Film studios face high financial risk when allocating marketing budgets and production resources before a movie’s release. This project delivers an end-to-end MLOps pipeline that predicts whether a film will be a commercial and critical “Hit” — defined as achieving a critic score ≥ 7.5 and audience engagement ≥ 1,000 votes — enabling data-driven decisions on resource prioritization and campaign targeting before theatrical release.
Architecture & Stack
| Layer | Tool / Technology | Role |
|---|---|---|
| Data Source | TMDB 5000 Movies Dataset (CSV) | Raw movie metadata for training |
| Feature Engineering | Python · Pandas · NumPy | Preprocessing, JSON parsing, log transforms, imputation |
| ML Training & Evaluation | scikit-learn (Logistic Regression · DictVectorizer) | Model training, one-hot encoding, AUC-ROC evaluation |
| Experiment Tracking | MLflow (SQLite backend) | Parameter logging, metric tracking, model versioning |
| Model Serving | BentoML · Uvicorn | REST API (/predict) serving HIT/FLOP probability |
| Containerization | Docker (via BentoML) | Reproducible deployment on python:3.12-slim |
| Dependency Management | UV | Locked, reproducible environment across machines |
Key Technical Achievements
Leakage-safe feature engineering — Eight input features carefully selected to exclude the target-defining columns (vote_average, vote_count), preventing data leakage and ensuring valid out-of-sample evaluation via AUC-ROC on an 80/20 split.
Modular, clean-architecture codebase — Responsibilities separated across single-purpose modules: data_utils.py (feature engineering), train.py (pipeline orchestration), and service.py (API definition), with shared constants defined once and imported where needed.
Reproducible MLOps environment — Dependency management via uv with a committed uv.lock and a pinned .python-version (Python 3.12). MLflow tracks every training run — hyperparameters (C), metrics (test_auc), and BentoML model tags — enabling full experiment lineage.
Distribution-aware preprocessing — budget and revenue features are log-transformed via log1p(x) to stabilize heavy right-skewed distributions. Numerical missing values are imputed with medians; categorical gaps default to "Unknown", ensuring no training record is silently dropped.