Open Library DLT Pipeline

An ELT pipeline to centralize bibliographic data from the Open Library public API into a local analytical database, enabling structured exploration of book and author metadata to support data-driven decisions and downstream analytics consumption.

Architecture & Stack

Ingestion → dlt REST API source with rest_api_resources()
Storage → DuckDB (embedded, file-based analytical database)
Transformation & Querying → Ibis SQL abstraction layer
Visualization → Altair inside a Marimo interactive notebook
Environment → uv with a deterministic lock file

Key Technical Achievements

Declarative REST API Ingestion — Configured a dlt REST API source using @dlt.source and rest_api_resources(), with a replace write disposition that guarantees idempotent full-refresh loads and automatic schema inference from nested JSON responses (e.g., books__authors child table auto-generated by dlt).

Dependency Isolation & Modern Tooling — Project dependencies declared in pyproject.toml and locked with uv.lock via uv, ensuring fully reproducible environments. Secrets and runtime configuration are isolated in .dlt/secrets.toml and .dlt/config.toml (excluded from version control), following security best practices for credential management.

Structured Query Layer — Raw nested API responses are normalized by dlt into relational DuckDB tables (books, books__authors). An Ibis query layer sits on top to perform author-level aggregations (book count per author, top-10 ranking), decoupling raw storage from analytical consumption and providing a reusable SQL abstraction ready to evolve into a dimensional model.

Repository

github.com/sargent-mg/open-library-dlt-pipeline