Media Hit Prediction ML

Film studios face high financial risk when allocating marketing budgets and production resources before a movie’s release. This project delivers an end-to-end MLOps pipeline that predicts whether a film will be a commercial and critical “Hit” — defined as achieving a critic score ≥ 7.5 and audience engagement ≥ 1,000 votes — enabling data-driven decisions on resource prioritization and campaign targeting before theatrical release.

Architecture & Stack

Layer	Tool / Technology	Role
Data Source	TMDB 5000 Movies Dataset (CSV)	Raw movie metadata for training
Feature Engineering	Python · Pandas · NumPy	Preprocessing, JSON parsing, log transforms, imputation
ML Training & Evaluation	scikit-learn (Logistic Regression · DictVectorizer)	Model training, one-hot encoding, AUC-ROC evaluation
Experiment Tracking	MLflow (SQLite backend)	Parameter logging, metric tracking, model versioning
Model Serving	BentoML · Uvicorn	REST API (`/predict`) serving HIT/FLOP probability
Containerization	Docker (via BentoML)	Reproducible deployment on `python:3.12-slim`
Dependency Management	UV	Locked, reproducible environment across machines

Key Technical Achievements

Leakage-safe feature engineering — Eight input features carefully selected to exclude the target-defining columns (vote_average, vote_count), preventing data leakage and ensuring valid out-of-sample evaluation via AUC-ROC on an 80/20 split.

Modular, clean-architecture codebase — Responsibilities separated across single-purpose modules: data_utils.py (feature engineering), train.py (pipeline orchestration), and service.py (API definition), with shared constants defined once and imported where needed.

Reproducible MLOps environment — Dependency management via uv with a committed uv.lock and a pinned .python-version (Python 3.12). MLflow tracks every training run — hyperparameters (C), metrics (test_auc), and BentoML model tags — enabling full experiment lineage.

Distribution-aware preprocessing — budget and revenue features are log-transformed via log1p(x) to stabilize heavy right-skewed distributions. Numerical missing values are imputed with medians; categorical gaps default to "Unknown", ensuring no training record is silently dropped.

Repository

github.com/sargent-mg/media-hit-prediction-ml