An ELT pipeline to centralize bibliographic data from the Open Library public API into a local analytical database, enabling structured exploration of book and author metadata to support data-driven decisions and downstream analytics consumption.
Architecture & Stack
- Ingestion →
dltREST API source withrest_api_resources() - Storage → DuckDB (embedded, file-based analytical database)
- Transformation & Querying → Ibis SQL abstraction layer
- Visualization → Altair inside a Marimo interactive notebook
- Environment →
uvwith a deterministic lock file
Key Technical Achievements
Declarative REST API Ingestion — Configured a dlt REST API source using @dlt.source and rest_api_resources(), with a replace write disposition that guarantees idempotent full-refresh loads and automatic schema inference from nested JSON responses (e.g., books__authors child table auto-generated by dlt).
Dependency Isolation & Modern Tooling — Project dependencies declared in pyproject.toml and locked with uv.lock via uv, ensuring fully reproducible environments. Secrets and runtime configuration are isolated in .dlt/secrets.toml and .dlt/config.toml (excluded from version control), following security best practices for credential management.
Structured Query Layer — Raw nested API responses are normalized by dlt into relational DuckDB tables (books, books__authors). An Ibis query layer sits on top to perform author-level aggregations (book count per author, top-10 ranking), decoupling raw storage from analytical consumption and providing a reusable SQL abstraction ready to evolve into a dimensional model.