Air pollution in Mexican cities routinely exceeds WHO guidelines, yet measurement data from 300+ monitoring stations across 121 cities is scattered and inconsistently formatted. This pipeline centralizes, cleans, and transforms those readings into analytics-ready tables, enabling analysts and public health stakeholders to identify pollution hotspots, track compliance against WHO thresholds, and act on near-real-time air quality trends.
Architecture & Stack
- Ingestion → Dual-path: batch loading from OpenAQ S3 archive (historical CSV.gz, 192K+ rows) + near-real-time streaming via Kafka-compatible Redpanda broker
- Storage → Google BigQuery: 3-layer warehouse (
air_quality_raw→air_quality_staging→air_quality_marts), date-partitioned and clustered by pollutant parameter - Transformation → Declarative SQL assets via Bruin CLI: deduplication, unit normalization (ppm → µg/m³), outlier filtering, WHO AQI category classification, city-day aggregation
- Orchestration → Bruin CLI DAG scheduler with 17 automated inline data quality checks
- Infrastructure → Terraform (GCS bucket + BigQuery datasets), Docker Compose (Redpanda broker)
- Visualization → Looker Studio dashboard with KPI scorecards, bubble map, city rankings, and time-series trends
Key Technical Achievements
Incremental & idempotent batch loading — A configurable lookback window (LOOKBACK_DAYS=7) combined with a delete-then-append pattern ensures each daily run processes only recent data without duplicates, reducing BigQuery compute costs versus a full-table refresh.
Dual ingestion architecture (batch + streaming) — Kafka producer/consumer pair using Redpanda and confluent-kafka, with micro-batching (200 records or 15-second flush interval) to balance latency against BigQuery insertion costs.
Dimensional star schema — Fact table (fct_city_daily_aqi) at city-day-parameter granularity with pre-computed WHO compliance flags and AQI category labels, alongside a dim_stations dimension table with reliability tier scoring.
Automated data quality enforcement — 17 Bruin-native quality checks (not_null, unique, non_negative, positive, accepted_values) declared inline with each SQL asset, combined with outlier filtering and idempotent Kafka offset commits.