NYC Taxi ELT Pipeline

Centralization of NYC taxi trip data from a paginated REST API into a structured local warehouse, enabling interactive analytical reporting on fare patterns, tip behavior, and temporal demand — reducing the friction between raw data availability and business insight consumption.

Architecture & Stack

Ingestion → dlt with a custom QueryParamPaginator class fetching paginated JSON records from a REST API endpoint
Storage → DuckDB (local file taxi_pipeline.duckdb, dataset nyc_taxi_data)
Transformation & Analysis → Ibis (DuckDB query engine) and Pandas for column-level aggregations, filtering, and derived metrics (tip percentage, hourly bucketing)
Visualization / BI → Altair + Marimo reactive notebook (5 interactive charts: histogram, scatter, bar, line, heatmap)
Environment → uv with frozen uv.lock; Python 3.14 pinned via .python-version

Key Technical Achievements

Custom Pagination Engine — Implemented a QueryParamPaginator class extending dlt’s BasePaginator, handling page-based API pagination with graceful termination on empty responses — preventing infinite loops and ensuring reliable full-dataset extraction across API pages.

Modular Codebase & Reproducible Environments — Separation of concerns enforced across distinct modules (taxi_pipeline.py for ingestion, taxi_insights_marimo.py for analytics); dependencies managed with uv and a committed lock file guaranteeing bit-for-bit reproducible builds.

Append-Only Fact Table with Schema Inference — dlt automatically infers and evolves the nyc_taxi_trips schema from raw JSON payloads, writing to a single denormalized fact table ready to evolve into a dimensional model.

In-Query Data Quality Controls — Value-level validation at the transformation layer: fare_amt > 0 and tip_amt >= 0 filters exclude corrupt records; trip_distance <= 20 caps outliers for histogram accuracy; scatter plots limited to 2,000 sampled rows for rendering performance.

Repository

github.com/sargent-mg/taxi-dlt-pipeline