ML · Imbalanced Classification
Credit Card Fraud Detection.
An end-to-end fraud pipeline on a 284k-transaction dataset with a 0.17 per cent fraud rate, where the threshold is the real work.
Challenge
The dataset has 284,807 transactions and a 0.17 per cent fraud rate. A naive model scores 99.8 per cent accuracy by predicting that nothing is ever fraud, so accuracy is a trap and the whole problem lives in how you handle the imbalance.
Approach
A pipeline that takes class imbalance seriously: SMOTE oversampling on the training fold only, a head-to-head comparison of XGBoost, Random Forest and Logistic Regression, and precision-recall threshold tuning rather than chasing accuracy.
Outcome
A model evaluated on the metric that actually matters for fraud, the precision-recall trade-off at a usable operating threshold, with the full analysis open for review on Hugging Face and GitHub.
Key decisions
- SMOTE applied inside the training fold only, avoiding the leakage that inflates naive imbalanced-data results.
- Three models compared head to head: XGBoost, Random Forest and Logistic Regression.
- Operating point chosen on the precision-recall curve, not on accuracy, because accuracy is meaningless at a 0.17 per cent base rate.
- Reproducible notebook plus an interactive demo so the trade-offs are inspectable, not just claimed.