Centralization of NYC taxi trip data from a paginated REST API into a structured local warehouse, enabling interactive analytical reporting on fare patterns, tip behavior, and temporal demand — reducing the friction between raw data availability and business insight consumption.
Architecture & Stack
- Ingestion →
dltwith a customQueryParamPaginatorclass fetching paginated JSON records from a REST API endpoint - Storage → DuckDB (local file
taxi_pipeline.duckdb, datasetnyc_taxi_data) - Transformation & Analysis → Ibis (DuckDB query engine) and Pandas for column-level aggregations, filtering, and derived metrics (tip percentage, hourly bucketing)
- Visualization / BI → Altair + Marimo reactive notebook (5 interactive charts: histogram, scatter, bar, line, heatmap)
- Environment →
uvwith frozenuv.lock; Python 3.14 pinned via.python-version
Key Technical Achievements
Custom Pagination Engine — Implemented a QueryParamPaginator class extending dlt’s BasePaginator, handling page-based API pagination with graceful termination on empty responses — preventing infinite loops and ensuring reliable full-dataset extraction across API pages.
Modular Codebase & Reproducible Environments — Separation of concerns enforced across distinct modules (taxi_pipeline.py for ingestion, taxi_insights_marimo.py for analytics); dependencies managed with uv and a committed lock file guaranteeing bit-for-bit reproducible builds.
Append-Only Fact Table with Schema Inference — dlt automatically infers and evolves the nyc_taxi_trips schema from raw JSON payloads, writing to a single denormalized fact table ready to evolve into a dimensional model.
In-Query Data Quality Controls — Value-level validation at the transformation layer: fare_amt > 0 and tip_amt >= 0 filters exclude corrupt records; trip_distance <= 20 caps outliers for histogram accuracy; scatter plots limited to 2,000 sampled rows for rendering performance.