Adrián López Rendón · projects

Mexico Air Quality Monitoring Pipeline

292 words 2 min read #Data Engineering#BigQuery#Kafka#Terraform

Air pollution in Mexican cities routinely exceeds WHO guidelines, yet measurement data from 300+ monitoring stations across 121 cities is scattered and inconsistently formatted. This pipeline centralizes, cleans, and transforms those readings into analytics-ready tables, enabling analysts and public health stakeholders to identify pollution hotspots, track compliance against WHO thresholds, and act on near-real-time air quality trends.

Architecture & Stack

  • Ingestion → Dual-path: batch loading from OpenAQ S3 archive (historical CSV.gz, 192K+ rows) + near-real-time streaming via Kafka-compatible Redpanda broker
  • Storage → Google BigQuery: 3-layer warehouse (air_quality_rawair_quality_stagingair_quality_marts), date-partitioned and clustered by pollutant parameter
  • Transformation → Declarative SQL assets via Bruin CLI: deduplication, unit normalization (ppm → µg/m³), outlier filtering, WHO AQI category classification, city-day aggregation
  • Orchestration → Bruin CLI DAG scheduler with 17 automated inline data quality checks
  • Infrastructure → Terraform (GCS bucket + BigQuery datasets), Docker Compose (Redpanda broker)
  • Visualization → Looker Studio dashboard with KPI scorecards, bubble map, city rankings, and time-series trends

Key Technical Achievements

Incremental & idempotent batch loading — A configurable lookback window (LOOKBACK_DAYS=7) combined with a delete-then-append pattern ensures each daily run processes only recent data without duplicates, reducing BigQuery compute costs versus a full-table refresh.

Dual ingestion architecture (batch + streaming) — Kafka producer/consumer pair using Redpanda and confluent-kafka, with micro-batching (200 records or 15-second flush interval) to balance latency against BigQuery insertion costs.

Dimensional star schema — Fact table (fct_city_daily_aqi) at city-day-parameter granularity with pre-computed WHO compliance flags and AQI category labels, alongside a dim_stations dimension table with reliability tier scoring.

Automated data quality enforcement — 17 Bruin-native quality checks (not_null, unique, non_negative, positive, accepted_values) declared inline with each SQL asset, combined with outlier filtering and idempotent Kafka offset commits.

Repository

github.com/sargent-mg/air-quality-bruin