Enhancing Data Integrity for a National Charitable Initiative
Summary
Tl;Dr: The national non-profit needed to review existing data from thousands of schools and districts across the U.S. Checking which sites were open, closed, available or no longer operating. We accomplished this by enriching the non-profit’s database with Google Places API and using heuristics to flag closed programs. The project took only a few days instead of month of manual research and phone calling. This non-profit group works with families and communities to evaluate literacy programs across 39 states, 150 communities and millions of parents and kids through innovative partnerships. The national non-profit has a goal to increase literacy and eliminate education gaps.
Background: The national education non-profit had accumulated a vast database of programs. As the data grew, so did the challenges of ensuring its accuracy and relevance. To address these challenges, they turned to Ruffin Galactic.
Challenge: While the primary database was extensive, not all records were up-to-date or accurate. The organization's primary concern was to ensure that individuals seeking educational resources could access accurate and relevant information. The task was to update a significant portion of these records and ensure their validity.
Solution: Ruffin Galactic, devised a comprehensive strategy:
- Read Replica Creation: To safeguard the primary database during the enhancement process, a read replica was established.
- Data Enrichment: Utilizing the Google Places API as an external data source, we were able to update 7,000 records in the database. This not only enriched the specific program records with pertinent information but also eliminated the need for the organization to manually request managers to update details, streamlining the entire process.
- Database Cleaning: The primary objective of this phase was to ensure the accuracy of the records, especially given the importance of the educational resources they represented. 2.5k were manually reviewed. Invalid entries were identified and rectified, ensuring that those seeking educational resources encountered no barriers in accessing them.
- Reporting & Dashboarding: Comprehensive reporting and dashboarding tools were set up to provide stakeholders with a multi-dimensional view of the data. These tools allowed for historical data analysis, and trend identification. Interactive dashboards offered visual representations of the data, making it easier for stakeholders to understand the reach and impact of each program. Furthermore, the reporting tools highlighted key metrics and insights, revealing patterns and areas of improvement that had previously gone unnoticed. This in-depth analysis, combined with the ability to customize reports based on specific criteria, empowered the organization to make informed decisions and strategize more effectively for the future.
Outcome: The collaboration between the national education non-profit and Ruffin Galactic resulted in a more streamlined, accurate, and enriched database. The reporting provided by Ruffin Galactic offered invaluable insights into the programs collected over the past 20 years, marking the first time the organization had access to such comprehensive information. The cleanup effort meticulously corrected the details of thousands of programs, further solidifying the database's reliability. Above all, this collaboration ensured the organization's mission of seamlessly connecting learners with the educational resources they need was not only maintained but enhanced, guaranteeing a lasting positive impact on the communities they serve.
Technical Notes
PostgreSQL
PostgreSQL is a powerful and flexible relational database management system but it is not designed for aggregating and analyzing large amounts of data. It’s built-in support for basic aggregation and analysis is not as efficient or user-friendly as other specialized tools for handling big data.
Google BigQuery
BigQuery is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data. It provides a single, unified interface for all data practitioners to simplify analytics workflows from data ingestion and preparation to data exploration and visualization to ML model creation and use. one of the key advantages of BigQuery is its easy integration with a wide variety of analytical environments and business intelligence (BI) tools.
Benefits
- Data Analytics: Datastream ensures that your BigQuery data is always up-to-date, allowing for real-time analytics and reporting.
- Auto-scaling: BigQuery automatically scales both up and down based on data volume and the complexities of queries.
- Cost-Effective Storage: There is no cost for infrastructure. However, you have to pay for storing data and for querying data ($0 per 1 TB of data analyzed).
- Simplified Data Management: Replicating data to BigQuery eliminates the need for manual ETL processes, reducing data management complexity.
- Data Security: All the data in BigQuery is encrypted by default. Security is guaranteed and no user configuration is required.
- ML Capabilities: ML modules have been added to BigQuery, and these ML models can be created and trained using basic SQL.
Implementation
High-level ETL (Batch) Process to Set Up PostgreSQL to BigQuery Integration
- Extract Data from PostgreSQL
- Transform Data - to BigQuery’s format (encoding, key constraints and column types)
- Upload to Google Cloud Storage(GCS)
- Upload to the BigQuery Table from GCS
- Update Target Table in BigQuery
- Automate these steps using Docker and Cloud Composer
The docker container runs (cron job) the extraction, transform and upload scripts and uploads to GCS as csv / json.
Cloud functions is listening for a put event on GCS which triggers it to upload it to GCS.