Adrián López Rendón · projects

Media Hit Prediction ML

292 words 2 min read #MLOps#Python#BentoML#MLflow

Film studios face high financial risk when allocating marketing budgets and production resources before a movie’s release. This project delivers an end-to-end MLOps pipeline that predicts whether a film will be a commercial and critical “Hit” — defined as achieving a critic score ≥ 7.5 and audience engagement ≥ 1,000 votes — enabling data-driven decisions on resource prioritization and campaign targeting before theatrical release.

Architecture & Stack

LayerTool / TechnologyRole
Data SourceTMDB 5000 Movies Dataset (CSV)Raw movie metadata for training
Feature EngineeringPython · Pandas · NumPyPreprocessing, JSON parsing, log transforms, imputation
ML Training & Evaluationscikit-learn (Logistic Regression · DictVectorizer)Model training, one-hot encoding, AUC-ROC evaluation
Experiment TrackingMLflow (SQLite backend)Parameter logging, metric tracking, model versioning
Model ServingBentoML · UvicornREST API (/predict) serving HIT/FLOP probability
ContainerizationDocker (via BentoML)Reproducible deployment on python:3.12-slim
Dependency ManagementUVLocked, reproducible environment across machines

Key Technical Achievements

Leakage-safe feature engineering — Eight input features carefully selected to exclude the target-defining columns (vote_average, vote_count), preventing data leakage and ensuring valid out-of-sample evaluation via AUC-ROC on an 80/20 split.

Modular, clean-architecture codebase — Responsibilities separated across single-purpose modules: data_utils.py (feature engineering), train.py (pipeline orchestration), and service.py (API definition), with shared constants defined once and imported where needed.

Reproducible MLOps environment — Dependency management via uv with a committed uv.lock and a pinned .python-version (Python 3.12). MLflow tracks every training run — hyperparameters (C), metrics (test_auc), and BentoML model tags — enabling full experiment lineage.

Distribution-aware preprocessingbudget and revenue features are log-transformed via log1p(x) to stabilize heavy right-skewed distributions. Numerical missing values are imputed with medians; categorical gaps default to "Unknown", ensuring no training record is silently dropped.

Repository

github.com/sargent-mg/media-hit-prediction-ml