Airline Passenger Satisfaction ML

Airlines face intense competition where customer retention directly impacts revenue. Identifying dissatisfied passengers before they churn is critical. This project builds an end-to-end ML pipeline to predict passenger satisfaction in real time, enabling airline operations teams to trigger proactive customer service interventions and reduce churn through data-driven decision-making.

Architecture & Stack

Python 3.12 — core language for all training and serving logic
XGBoost + Scikit-Learn — gradient boosting champion model and logistic regression baseline
MLflow — experiment tracking; logs AUC metrics, hyperparameters, and model artifacts to a local SQLite store
BentoML — model packaging and production REST API serving with auto-generated Swagger UI (POST /predict)
Docker — containerization using a python:3.11-slim multi-stage image with libgomp1 for XGBoost optimization
uv — deterministic dependency management via uv.lock

Key Technical Achievements

Train-Serve Consistency — The DictVectorizer preprocessor is bundled directly into the BentoML model store as a custom object (custom_objects={"dv": dv}), guaranteeing that the same categorical encoding learned at training time is applied identically at inference time, eliminating training-serving skew.

Modular Code & Reproducible Environments — Code is cleanly separated by responsibility: train.py handles the full training pipeline, service.py defines the production API, and bentofile.yaml governs the build contract. Random seeds are pinned, Python version is locked via .python-version, and all dependencies are frozen in uv.lock.

Feature Preprocessing & Experiment Tracking — A 24-feature dataset (service ratings on a 0–5 scale, flight metadata, and passenger demographics) is preprocessed via column normalization and median imputation, then encoded with DictVectorizer. All training runs are logged to MLflow, enabling metric comparison (AUC) and model selection with full traceability.

Defensive Data Handling — The training pipeline includes explicit try/except FileNotFoundError handling, a defensive type-check on the target variable before binary encoding, and median imputation for arrival_delay_in_minutes — a robust strategy resilient to skewed delay distributions.

Repository

github.com/sargent-mg/airline-passenger-satisfaction-ml