Adrián López Rendón · projects

Open Library DLT Pipeline

228 words 2 min read #Data Engineering#DuckDB#dlt#Python

An ELT pipeline to centralize bibliographic data from the Open Library public API into a local analytical database, enabling structured exploration of book and author metadata to support data-driven decisions and downstream analytics consumption.

Architecture & Stack

  • Ingestiondlt REST API source with rest_api_resources()
  • Storage → DuckDB (embedded, file-based analytical database)
  • Transformation & Querying → Ibis SQL abstraction layer
  • Visualization → Altair inside a Marimo interactive notebook
  • Environmentuv with a deterministic lock file

Key Technical Achievements

Declarative REST API Ingestion — Configured a dlt REST API source using @dlt.source and rest_api_resources(), with a replace write disposition that guarantees idempotent full-refresh loads and automatic schema inference from nested JSON responses (e.g., books__authors child table auto-generated by dlt).

Dependency Isolation & Modern Tooling — Project dependencies declared in pyproject.toml and locked with uv.lock via uv, ensuring fully reproducible environments. Secrets and runtime configuration are isolated in .dlt/secrets.toml and .dlt/config.toml (excluded from version control), following security best practices for credential management.

Structured Query Layer — Raw nested API responses are normalized by dlt into relational DuckDB tables (books, books__authors). An Ibis query layer sits on top to perform author-level aggregations (book count per author, top-10 ranking), decoupling raw storage from analytical consumption and providing a reusable SQL abstraction ready to evolve into a dimensional model.

Repository

github.com/sargent-mg/open-library-dlt-pipeline