Churn Prevention via Survival Analysis & Next-Best-Action
Most churn prediction systems only tell you which customers might leave, but not when they will leave or what to do about it. This system solves both problems: predicting not just who will churn but when each customer is likely to leave, then automatically recommending the best retention action. It treats churn as a timing problem rather than a simple yes/no question. The system includes automated daily batch processing for your full customer base and instant individual predictions through a FastAPI service, so retention teams can act immediately.
Problem statement
Classification answers if a customer will churn within a window. Survival analysis answers when churn becomes likely for each individual. Timing unlocks proactive outreach, smarter incentive timing, and better budget allocation.
Solution overview
The product converts a telecom churn dataset into survival format, explores segment-level retention with Kaplan-Meier curves, trains two models, and operationalizes the best model for both batch campaigns and real-time calls. A decision matrix maps risk and value to the next best action so teams know exactly what to do and when.
Data & survival transformation
Source: IBM Telco Customer Churn. Tenure in months is the time variable. Event flag = 1 for churned, 0 for right-censored (still active at last observation). Categorical fields one-hot encoded. Compact autopay flag derived from payment method. Senior citizen, contract type, internet service type, support add-ons, and billing fields retained.
This structure preserves information from active customers and avoids bias that a plain classifier creates by discarding censored rows.
Exploratory survival analysis
Kaplan-Meier curves show how retention changes with time. Segment curves reveal where risk concentrates:
- Contract: month-to-month customers drop fast; one- and two-year contracts stay longer.
- Internet service: fiber has higher churn than DSL, both higher than no internet.
- Autopay: non-autopay customers churn more than autopay users.
- Senior status: seniors churn faster than non-seniors.
Log-rank tests confirm these curves differ in a statistically meaningful way. These insights shape features and the action plan.
Modeling approach
Cox Proportional Hazards
Cox model estimates how each feature scales the hazard of churn. Hazard ratios give clear levers: month-to-month raises risk vs. two-year; non-autopay raises risk vs. autopay; fiber raises risk vs. DSL.
Gradient Boosting Survival
A gradient-boosted survival model captures non-linearities and interactions while optimizing a Cox-style objective. Usually ranks risk better while Cox remains easier to explain.
Evaluation protocol
Concordance index
C-index measures how well the model orders customers by time to churn. Values well above 0.5 show useful discrimination. The boosted model typically edges out Cox.
Time-dependent Brier score
Evaluates accuracy of predicted survival probabilities at chosen horizons with proper censoring weights. Lower is better. Curves across 6–24 months help judge calibration.
Lift at top K%
If we target the top decile by risk, how many more churners do we capture compared to random targeting? Strong lift means efficient spend.
Results
- Kaplan-Meier confirmed contract length, autopay, internet type, and senior status as strong drivers.
- Cox delivered clear hazard ratios that match survival curves — works as a story engine for stakeholders.
- Gradient boosting improved rank order and reduced Brier at common horizons — the production scorer.
- Lift improved in the top decile — campaigns cheaper for the same save rate.
| Aspect | CoxPH | Boosted Survival |
|---|---|---|
| Interpretability | High — hazard ratios | Lower — feature importance |
| Discrimination | Good | Very good |
| Calibration | Good | Good to very good |
| When to use | Explain drivers & set policy | Score customers for campaigns |
Next-Best-Action strategy
Prediction has value only when it leads to action. The system segments by churn risk and customer value, then maps each cell to a concrete play: high-value + high-risk gets a high-touch offer and early renewal path; medium risk gets automated education or usage nudges; low risk is monitored to avoid wasted spend.
Deployment paths
Batch scoring
A scheduled job loads the latest snapshot, applies the preprocessing pipeline and trained model, computes survival probabilities at chosen horizons, and exports a scored file with a recommended next action for each customer. Feeds marketing tools and CRM lists.
Real-time API
A FastAPI service loads the serialized pipeline and model on startup, exposes a POST endpoint, validates payloads with Pydantic, runs the pipeline, and returns a risk score with survival probabilities. Can be queried by agent desktops or web apps.
Engineering details
- Preprocessing: ColumnTransformer for numeric scaling and one-hot encoding; same transformer reused in training and inference.
- Validation: train-test split with stratified churn rate, optional cross-validation for tuning the boosted model.
- Serialization: joblib for the full pipeline + model bundle to keep inference consistent.
- Monitoring: track C-index drift and Brier at 6 and 12 months, track campaign lift by decile, audit calibration with reliability plots.
Limits & future work
- Static features only — add time-varying covariates like monthly usage and ticket counts.
- No direct uplift modeling yet — introduce controlled experiments and train meta-learners.
- Single event focus — explore competing risks to separate price churn from service churn.
- Localization — tune actions and thresholds by region and plan type.
FAQ
Why survival analysis instead of a classifier?
It uses censored information from active customers and gives time-based probabilities. That supports proactive timing which a plain classifier cannot do well.
Which model should I trust day-to-day?
Use the boosted survival model for scoring and Cox for explanations. They complement each other.
How do I act on the scores?
Use the decision matrix. Pick actions that match both risk and value. Schedule contact before the customer's risk spike.
What does a good result look like?
Higher lift in the top decile and well-calibrated probabilities at the horizons that matter to your business — fewer lost customers for the same budget.