Analysis Walkthrough: Supervised Classification with Bank Churn Data
Published:
This post provides a walkthrough demonstrating how to use the sklearn
package in Python to tune and evaluate multiple supervised classification methods, such as logistic regression and extreme gradient boosting (XGBoost) to predict whether bank customers will close their account. The dataset comes from a past Kaggle competition and contains several variables, including credit score, gender, and age.
Code to produce this blog post can be found in this GitHub repository.
Data description
Data for this analysis comes from a previous Kaggle playground competition titled “Binary Classification with a Bank Churn Dataset”.
Bank churn, which is also known as customer attrition, is when customers end their relationship with the bank (close their accounts). Predicting churn is essential to allow the bank to take action to retain customers. The cost of acquiring a new customer is almost always higher than retaining an existing customer1.
Analysis Goal: Predict the probability that a customer Exited
(probability of churn).
Train and test datsets are provided by Kaggle, and we want to minimize the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, which is also known as the “ROC AUC score”.
For this analysis, I’ll walk through the following steps:
- Exploratory data analysis
- Model building (tuning and evaluation)
- Prediction on new data
Exploratory data analysis (EDA)
For this analysis, we will be looking exclusively at the training dataset provided by Kaggle until we make our final predictions on the testing data. This mimics a real-life scenario where the future “new” data a model will be used on is not available until after model training, tuning, and selection.
First, we import necessary functions, load in the data, and evaluate the structure of the training dataset:
# Import necessary libraries/functions
# Data wrangling and computation
import pandas as pd
import numpy as np
# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn import model_selection
import sklearn.metrics as metrics
# Plotting
import matplotlib.pyplot as plt
from plotnine import ggplot, aes, facet_wrap, geom_histogram, labs
import seaborn as sns
# Load in train and test sets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Look at the first few rows of the training data
train.head()
id | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 15674932 | Okwudilichukwu | 668 | France | Male | 33.0 | 3 | 0.00 | 2 | 1.0 | 0.0 | 181449.97 | 0 |
1 | 1 | 15749177 | Okwudiliolisa | 627 | France | Male | 33.0 | 1 | 0.00 | 2 | 1.0 | 1.0 | 49503.50 | 0 |
2 | 2 | 15694510 | Hsueh | 678 | France | Male | 40.0 | 10 | 0.00 | 2 | 1.0 | 0.0 | 184866.69 | 0 |
3 | 3 | 15741417 | Kao | 581 | France | Male | 34.0 | 2 | 148882.54 | 1 | 1.0 | 1.0 | 84560.88 | 0 |
4 | 4 | 15766172 | Chiemenam | 716 | Spain | Male | 33.0 | 5 | 0.00 | 2 | 1.0 | 1.0 | 15068.83 | 0 |
# Check column types, see if any rows have null values
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165034 entries, 0 to 165033
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 165034 non-null int64
1 CustomerId 165034 non-null int64
2 Surname 165034 non-null object
3 CreditScore 165034 non-null int64
4 Geography 165034 non-null object
5 Gender 165034 non-null object
6 Age 165034 non-null float64
7 Tenure 165034 non-null int64
8 Balance 165034 non-null float64
9 NumOfProducts 165034 non-null int64
10 HasCrCard 165034 non-null float64
11 IsActiveMember 165034 non-null float64
12 EstimatedSalary 165034 non-null float64
13 Exited 165034 non-null int64
dtypes: float64(5), int64(6), object(3)
memory usage: 17.6+ MB
# Check how many missing values are in the training data
train.apply(lambda x: x.isna().sum())
id 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64
There are no null or missing values. We also have the following predictor variables (features):
Categorical
Geography
: customer’s country of residence (France, Spain, or Germany)Gender
: customer’s gender (Male or Female)
Quantitative
CreditScore
: customer’s credit score (numerical score)Age
: customer’s age (years)Tenure
: number of years a customer has had an account with the bankBalance
: customer’s account balanceNumOfProducts
: number of bank products used by the customer (e.g., savings account, credit card)EstimatedSalary
: customer’s estimated salary
Logical
HasCrCard
: whether the customer has a credit card (1 = yes, 0 = no)IsActiveMember
: whether the customer is an active member (1 = yes, 0 = no)
Miscellaneous (not useful)
id
: Row numberCustomerID
: unique identifier for each customerSurname
: customer’s last name
And the target (response) variable we want to predict: Exited
.
# Basic summary statistics for each column in train
train.drop(columns=['id', 'CustomerId', 'Surname']).describe()
CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|
count | 165034.000000 | 165034.000000 | 165034.000000 | 165034.000000 | 165034.000000 | 165034.000000 | 165034.000000 | 165034.000000 | 165034.000000 |
mean | 656.454373 | 38.125888 | 5.020353 | 55478.086689 | 1.554455 | 0.753954 | 0.497770 | 112574.822734 | 0.211599 |
std | 80.103340 | 8.867205 | 2.806159 | 62817.663278 | 0.547154 | 0.430707 | 0.499997 | 50292.865585 | 0.408443 |
min | 350.000000 | 18.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 11.580000 | 0.000000 |
25% | 597.000000 | 32.000000 | 3.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 74637.570000 | 0.000000 |
50% | 659.000000 | 37.000000 | 5.000000 | 0.000000 | 2.000000 | 1.000000 | 0.000000 | 117948.000000 | 0.000000 |
75% | 710.000000 | 42.000000 | 7.000000 | 119939.517500 | 2.000000 | 1.000000 | 1.000000 | 155152.467500 | 0.000000 |
max | 850.000000 | 92.000000 | 10.000000 | 250898.090000 | 4.000000 | 1.000000 | 1.000000 | 199992.480000 | 1.000000 |
From the summary statistics, we see that overall churn is about 21% (about 21% of people exited).
To get an idea of the differences in values of each variable for those who closed their accounts and for those who did not, we can group by Exited
and compute the means and standard deviations across various categories:
# Look at means and standard deviations of different variables for the people that closed their accounts and for those who did not
train.drop(columns = ['id', 'CustomerId', 'Surname', 'Geography', 'Gender']).groupby('Exited').agg(['mean', 'std']).round(2)
CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | |
Exited | ||||||||||||||||
0 | 657.59 | 79.79 | 36.56 | 8.15 | 5.05 | 2.80 | 51255.81 | 62189.98 | 1.62 | 0.49 | 0.76 | 0.43 | 0.55 | 0.50 | 112084.29 | 50214.66 |
1 | 652.22 | 81.14 | 43.96 | 9.00 | 4.91 | 2.83 | 71209.98 | 62646.69 | 1.33 | 0.66 | 0.74 | 0.44 | 0.29 | 0.46 | 114402.50 | 50542.03 |
In general, most of the variables have similar values between customers with Exited
= 1 and customers with Exited
= 0. However, the average age is slightly higher for those with Exited
= 1, and IsActiveMember
is higher for customers who did not close their accounts (55% for Exited
= 0) than for those who did (29% for Exited
= 1), indicating that people may be more likely to close their account if they are not an active member.
To further explore this data, let’s create some histograms of each variable with a different color for the customers with Exited
= 1 and Exited
= 0.
# Make a df for plotting, change "Exited" to a "category" type to assist with plotting
train_plot = train.astype({'Exited': 'category'})
# Plot quantitative variables
(
# Reshape the data to facilitate plotting with ggplot
pd.melt(train_plot[['CreditScore', 'Age' ,'Balance', 'EstimatedSalary', 'Exited']], id_vars = 'Exited')
>> ggplot() +
geom_histogram(aes('value', fill = 'Exited')) +
facet_wrap('variable', scales = 'free') +
labs(x = 'Category', y = 'Count', title = 'Histograms of Quantitative Variables')
)
Let’s make a plot of the correlation matrix for the 6 continuous quantitative predictor variables:
# Plot categorical variables and quantitative variables with few categories (NumOfProducts, Tenure)
(
pd.melt(train_plot[['Gender', 'Geography', 'HasCrCard', 'IsActiveMember', 'NumOfProducts', 'Exited', 'Tenure']], id_vars = 'Exited')
>> ggplot() +
geom_histogram(aes('value', fill = 'Exited'), binwidth = .5) +
facet_wrap('variable', scales = 'free') +
labs(x = 'Value', y = 'Count', title = 'Histograms of Categorical Variables', subtitle = '(and Quantitative Variables with Few Categories)')
)
In general, the shapes of the distributions of the variables is similar. However, there is a noticeable difference in Age
, where the majority of customers with Exited
= 0 tend to be younger, while the majority of customers with Exited
= 1 tend to be older, which is similar to what was observed based on the summary statistics.
Now, let’s look at correlations between the quantitative predictors to determine if any substantial multicollinearity is present:
sns.heatmap(train.drop(columns = ['id', 'CustomerId', 'Surname', 'Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited']).corr(), annot = True)
plt.title('Correlation Plot')
plt.show()
Most of the predictors are uncorrelated with one another, but there appears to be a weak linear relationship between Balance
and NumOfProducts
.
Feature engineering
A common step in many machine learning projects is to “engineer” new features. The goal is to create new predictor columns that will provide better model predictions of the target variable, Exited
. Often, new features are created using domain expertise.
A naive approach to feature engineering is to simply augment the feature (predictor) matrix with polynomials of the quantitative variables. For simplicity, I’ll only add the following:
- a squared
EstimatedSalary
column, denotedEstimatedSalary2
- a squared
CreditScore
column, denotedCreditScore2
- a squared
Balance
column, denotedBalance2
train['EstimatedSalary2'] = train['EstimatedSalary']**2
train['CreditScore2'] = train['CreditScore']**2
train['Balance2'] = train['Balance']**2
# The new variables are now part of the training data
train.head()
id | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | EstimatedSalary2 | CreditScore2 | Balance2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 15674932 | Okwudilichukwu | 668 | France | Male | 33.0 | 3 | 0.00 | 2 | 1.0 | 0.0 | 181449.97 | 0 | 3.292409e+10 | 446224 | 0.000000e+00 |
1 | 1 | 15749177 | Okwudiliolisa | 627 | France | Male | 33.0 | 1 | 0.00 | 2 | 1.0 | 1.0 | 49503.50 | 0 | 2.450597e+09 | 393129 | 0.000000e+00 |
2 | 2 | 15694510 | Hsueh | 678 | France | Male | 40.0 | 10 | 0.00 | 2 | 1.0 | 0.0 | 184866.69 | 0 | 3.417569e+10 | 459684 | 0.000000e+00 |
3 | 3 | 15741417 | Kao | 581 | France | Male | 34.0 | 2 | 148882.54 | 1 | 1.0 | 1.0 | 84560.88 | 0 | 7.150542e+09 | 337561 | 2.216601e+10 |
4 | 4 | 15766172 | Chiemenam | 716 | Spain | Male | 33.0 | 5 | 0.00 | 2 | 1.0 | 1.0 | 15068.83 | 0 | 2.270696e+08 | 512656 | 0.000000e+00 |
Model building, tuning, and selection
We will evaluate the performance of 6 different classification methods on predicting Exited
in the training data:
- logistic regression (LR)
- k-nearest neighbors (KNN)
- support vector machine (SVM)
- random forest (RF)
- extreme gradient boosting (XGBoost)
- CatBoost (CB)
Below, each method will be tuned for optimal performance, and its performance will be evaluated in terms of ROC AUC. Then, the overall performance of each model will be compared, and the best model will be selected for final predictions on the testing data.
Set up the training data
Perform the following:
Create the predictor X_train
matrix:
- apply one hot encoding to the categorical variables
- for SVM, center and scale the numerical predictors
Create the target y_train
vector.
# Import additional functions for preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
# Creating the X_train matrix:
# Remove columns that are not predictors
X_train = train.drop(columns=['id', 'CustomerId','Surname', 'Exited'])
# Separate out the categorical columns for one hot encoding
X_train_encoding = X_train[['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']]
# Create a matrix of the one hot encoded categorical columns
encoder = OneHotEncoder(sparse_output = False)
X_train_encoded = encoder.fit_transform(X_train_encoding)
# Check the names of the columns that were created via one hot encoding
encoder.get_feature_names_out()
# The encoded matrix is currently a numpy array. Change this to a data frame
X_train_encoded = pd.DataFrame(X_train_encoded, columns = encoder.get_feature_names_out())
# Combine the numerical columns with the categorical columns to create a single X_train data frame
X_train = pd.concat([X_train.drop(columns = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']).reset_index(drop = True), X_train_encoded], axis = 1)
# Make sure everything looks correct
X_train.head()
CreditScore | Age | Tenure | Balance | NumOfProducts | EstimatedSalary | EstimatedSalary2 | CreditScore2 | Balance2 | Geography_France | Geography_Germany | Geography_Spain | Gender_Female | Gender_Male | HasCrCard_0.0 | HasCrCard_1.0 | IsActiveMember_0.0 | IsActiveMember_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 668 | 33.0 | 3 | 0.00 | 2 | 181449.97 | 3.292409e+10 | 446224 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
1 | 627 | 33.0 | 1 | 0.00 | 2 | 49503.50 | 2.450597e+09 | 393129 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
2 | 678 | 40.0 | 10 | 0.00 | 2 | 184866.69 | 3.417569e+10 | 459684 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
3 | 581 | 34.0 | 2 | 148882.54 | 1 | 84560.88 | 7.150542e+09 | 337561 | 2.216601e+10 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
4 | 716 | 33.0 | 5 | 0.00 | 2 | 15068.83 | 2.270696e+08 | 512656 | 0.000000e+00 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
y_train = train['Exited']
Logistic Regression (LR)
X_train.head()
CreditScore | Age | Tenure | Balance | NumOfProducts | EstimatedSalary | EstimatedSalary2 | CreditScore2 | Balance2 | Geography_France | Geography_Germany | Geography_Spain | Gender_Female | Gender_Male | HasCrCard_0.0 | HasCrCard_1.0 | IsActiveMember_0.0 | IsActiveMember_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 668 | 33.0 | 3 | 0.00 | 2 | 181449.97 | 3.292409e+10 | 446224 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
1 | 627 | 33.0 | 1 | 0.00 | 2 | 49503.50 | 2.450597e+09 | 393129 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
2 | 678 | 40.0 | 10 | 0.00 | 2 | 184866.69 | 3.417569e+10 | 459684 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
3 | 581 | 34.0 | 2 | 148882.54 | 1 | 84560.88 | 7.150542e+09 | 337561 | 2.216601e+10 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
4 | 716 | 33.0 | 5 | 0.00 | 2 | 15068.83 | 2.270696e+08 | 512656 | 0.000000e+00 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
# Initialize the logistic regression model (L1 regularization)
log_reg = LogisticRegression(solver = 'liblinear', penalty = 'l1')
# Set up 10-fold cross-validation so that we use the same 10 folds to evaluate all models.
kfold = model_selection.KFold(n_splits = 10, shuffle = True, random_state = 24)
# Perform 10-fold CV to evaluate logistic regression model performance
log_reg_auc = model_selection.cross_val_score(log_reg, X_train, y_train, cv = kfold, scoring = 'roc_auc')
# Logistic regression achieves a ROC AUC of about 81.8%
np.mean(log_reg_auc)
np.float64(0.8177866487039773)
K-Nearest Neighbors (KNN)
knn = KNeighborsClassifier()
# Set up a grid of hyperparameters to tune
knn_parameters = {'n_neighbors': [500, 1000, 2500, 5000],
'weights': ['uniform', 'distance']}
# Perform 10-fold cross-validation to obtain ROC AUC scores for each combination of hyperparameters
knn_cv = model_selection.GridSearchCV(knn, knn_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4) # use 4 cores in parallel to expedite tuning
knn_cv.fit(X_train, y_train)
# Make a dataframe to display the roc auc score for each combination of hyperparameters (sorted from best to worst)
pd.concat([pd.DataFrame(knn_cv.cv_results_['params']),
pd.DataFrame(knn_cv.cv_results_['mean_test_score'],
columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)
# The best KNN model uses n_neighbors = 500, weights = uniform, and achieves a ROC AUC score of 60.1%
n_neighbors | weights | roc auc | |
---|---|---|---|
0 | 500 | uniform | 0.601276 |
2 | 1000 | uniform | 0.600648 |
4 | 2500 | uniform | 0.599931 |
6 | 5000 | uniform | 0.598491 |
7 | 5000 | distance | 0.572966 |
Support Vector Machine Classifier (SVM)
svc = SVC()
svc_parameters = [
{'kernel': ['linear'], 'C': [0.1, 1, 10]},
{'kernel': ['rbf'], 'C': [0.1, 1, 10], 'gamma': [0.1, 1, 10]}
]
svc_cv = model_selection.GridSearchCV(svc, svc_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4)
# SVC is sensitive to the scale of the data, so let's first scale our continuous predictors
scaler = StandardScaler()
# Select the continuous features
X_train_continuous = X_train[['CreditScore', 'Age', 'Tenure', 'Balance',
'NumOfProducts', 'EstimatedSalary', 'EstimatedSalary2', 'CreditScore2', 'Balance2']]
# Apply scaling, turn into a dataframe
X_train_continuous_scaled = pd.DataFrame(scaler.fit_transform(X_train_continuous),
columns = ['CreditScore', 'Age', 'Tenure', 'Balance',
'NumOfProducts', 'EstimatedSalary', 'EstimatedSalary2', 'CreditScore2', 'Balance2'])
# Combine all columns together to create the final training data
X_train_svc = pd.concat([X_train_continuous_scaled,
X_train.drop(columns = ['CreditScore', 'Age', 'Tenure', 'Balance',
'NumOfProducts', 'EstimatedSalary', 'EstimatedSalary2', 'CreditScore2', 'Balance2'])], axis = 1)
# Fit the SVC
svc_cv.fit(X_train_svc, y_train)
pd.concat([pd.DataFrame(svc_cv.cv_results_['params']),
pd.DataFrame(svc_cv.cv_results_['mean_test_score'],
columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)
# The best SVM model uses C = 0.1, kernel = rbf, gamma = 0.1
# and achieves a ROC AUC score of 82.5%
C | kernel | gamma | roc auc | |
---|---|---|---|---|
3 | 0.1 | rbf | 0.1 | 0.825011 |
4 | 0.1 | rbf | 1.0 | 0.822602 |
0 | 0.1 | linear | NaN | 0.814939 |
1 | 1.0 | linear | NaN | 0.814919 |
2 | 10.0 | linear | NaN | 0.814917 |
Random Forest (RF)
rf = RandomForestClassifier()
rf_parameters = {'n_estimators': [500, 1000], 'max_features': [2, 4, 6], 'max_depth': [6, 9, 12]}
rf_cv = model_selection.GridSearchCV(rf, rf_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4)
rf_cv.fit(X_train, y_train)
pd.concat([pd.DataFrame(rf_cv.cv_results_['params']),
pd.DataFrame(rf_cv.cv_results_['mean_test_score'],
columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)
# The best RF model uses max_depth = 12, max_features = 6, and n_estimators = 1000
# and achieves a ROC AUC score of 88.8%
max_depth | max_features | n_estimators | roc auc | |
---|---|---|---|---|
17 | 12 | 6 | 1000 | 0.887904 |
16 | 12 | 6 | 500 | 0.887888 |
11 | 9 | 6 | 1000 | 0.887874 |
10 | 9 | 6 | 500 | 0.887870 |
15 | 12 | 4 | 1000 | 0.887769 |
Extreme Gradient Boosting (XGBoost)
xgb_clf = xgb.XGBClassifier()
xgb_parameters = {'learning_rate': [0.05, 0.1, 0.15, 0.3], 'max_depth': [5, 6, 7], 'colsample_bytree': [0.25, 0.5, 1]}
xgb_cv = model_selection.GridSearchCV(xgb_clf, xgb_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4)
xgb_cv.fit(X_train, y_train)
pd.concat([pd.DataFrame(xgb_cv.cv_results_['params']),
pd.DataFrame(xgb_cv.cv_results_['mean_test_score'],
columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)
# The best XGBoost model uses colsample_bytree = 0.5, learning_rate = 0.15, and max_depth = 5
# and achieves a ROC AUC score of about 89.0%
colsample_bytree | learning_rate | max_depth | roc auc | |
---|---|---|---|---|
18 | 0.5 | 0.15 | 5 | 0.890029 |
15 | 0.5 | 0.10 | 5 | 0.889848 |
16 | 0.5 | 0.10 | 6 | 0.889840 |
30 | 1.0 | 0.15 | 5 | 0.889653 |
27 | 1.0 | 0.10 | 5 | 0.889615 |
CatBoost (CB)
cbc_clf = CatBoostClassifier(cat_features = [1, 2, 6, 7, 8], od_type = "Iter", od_wait = 20)
cbc_parameters = {'learning_rate': [0.1, 0.2, 0.3], 'depth': [4, 6, 8], 'iterations': [50, 100, 150]}
cbc_cv = model_selection.GridSearchCV(cbc_clf, cbc_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4)
# CatBoost does not require one hot encoding of categorical variables, so we'll use the original training dataset here,
# changing some columns we'll treat as categories to "string" type to be processed properly by the model
catboost_train = train.drop(columns = ['id', 'CustomerId', 'Surname', 'Exited'])
catboost_train = catboost_train.astype({'NumOfProducts': 'string', 'HasCrCard': 'string', 'IsActiveMember': 'string'})
cbc_cv.fit(catboost_train, y_train)
pd.concat([pd.DataFrame(cbc_cv.cv_results_['params']),
pd.DataFrame(cbc_cv.cv_results_['mean_test_score'],
columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)
# The best catboost model uses depth = 6, iterations = 150, and learning_rate = 0.2,
# and achieves a ROC AUC score of just under 89.0%
depth | iterations | learning_rate | roc auc | |
---|---|---|---|---|
16 | 6 | 150 | 0.2 | 0.889753 |
13 | 6 | 100 | 0.2 | 0.889599 |
7 | 4 | 150 | 0.2 | 0.889540 |
24 | 8 | 150 | 0.1 | 0.889519 |
8 | 4 | 150 | 0.3 | 0.889510 |
Model selection
The best performance of each method is given in the following table:
Method | ROC AUC |
---|---|
LR | 81.8% |
KNN | 60.1% |
SVM | 82.2% |
RF | 88.8% |
XGBoost | 89.0% |
CB | 89.0% |
The best methods, by far, are RF, XGBoost, and CB, which had similar performance. However, I will select XGBoost as the final model because it achieved a slightly higher CV ROC AUC score than RF and CB.
One nice feature of XGBoost is the ability to easily visualize the importance of each variable for predicting Exited
. A feature important plot for the final XGBoost model is shown below. Note that
- The most important variables are
CreditScore
,Age
,EstimatedSalary
, andBalance
. - Feature importance decreases greatly after
Balance
. - Being an active member (
IsActiveMember
= 1 or 0) is not as important as previously thought based on initial EDA.
from xgboost import plot_importance
plot_importance(xgb_cv.best_estimator_)
plt.show()
Prediction for the testing data
Now that we’ve selected our best model based on cross-validation on the training data, we can use that model to obtain predictions for the testing set.
# Add engineered features to testing data
test['EstimatedSalary2'] = test['EstimatedSalary']**2
test['CreditScore2'] = test['CreditScore']**2
test['Balance2'] = test['Balance']**2
# Next, we need to one hot encode the categorical variables in the testing data
X_test = test.drop(columns=['id', 'CustomerId','Surname'])
# Separate out the categorical columns for one hot encoding, apply one hot encoding
X_test_encoding = X_test[['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']]
X_test_encoded = encoder.transform(X_test_encoding)
# The encoded matrix is currently a numpy array. Change this to a data frame
X_test_encoded = pd.DataFrame(X_test_encoded, columns = encoder.get_feature_names_out())
# Combine the numerical columns with the categorical columns to create a single X_test data frame
X_test = pd.concat([X_test.drop(columns = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']).reset_index(drop = True), X_test_encoded], axis = 1)
# Make sure everything looks correct
X_test.head()
CreditScore | Age | Tenure | Balance | NumOfProducts | EstimatedSalary | EstimatedSalary2 | CreditScore2 | Balance2 | Geography_France | Geography_Germany | Geography_Spain | Gender_Female | Gender_Male | HasCrCard_0.0 | HasCrCard_1.0 | IsActiveMember_0.0 | IsActiveMember_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 586 | 23.0 | 2 | 0.00 | 2 | 160976.75 | 2.591351e+10 | 343396 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
1 | 683 | 46.0 | 2 | 0.00 | 1 | 72549.27 | 5.263397e+09 | 466489 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
2 | 656 | 34.0 | 7 | 0.00 | 2 | 138882.09 | 1.928823e+10 | 430336 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
3 | 681 | 36.0 | 8 | 0.00 | 1 | 113931.57 | 1.298040e+10 | 463761 | 0.000000e+00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
4 | 752 | 38.0 | 10 | 121263.62 | 1 | 139431.00 | 1.944100e+10 | 565504 | 1.470487e+10 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
# Obtain predictions
final_predictions = pd.DataFrame(dict(id = test['id'],
Exited = xgb_cv.best_estimator_.predict_proba(X_test)[:, 1]))
# Look at the first few predictions
final_predictions.head()
id | Exited | |
---|---|---|
0 | 165034 | 0.023930 |
1 | 165035 | 0.835616 |
2 | 165036 | 0.028150 |
3 | 165037 | 0.232614 |
4 | 165038 | 0.338327 |
We now have our final predictions stored in the final_predictions
dataframe. Now, we can save those results to a .csv file and submit them to Kaggle to obtain our final score.
# Save predictions to .csv
final_predictions.to_csv('churn_predictions.csv', index = False)
Final Kaggle competition results
After submitting the predictions to Kaggle, a ROC AUC score on the testing data is returned.
According to Kaggle, the final ROC AUC score on the testing data of our model is 0.88864, or about 88.9%, which would have placed us in the top 38.7% of submissions during the competition (1406 place out of 3633 teams).
There are many additional things we could have done to improve model performance, such as:
- performing extensive feature engineering using domain expertise
- increasing grid search ranges for tuning model parameters
- making predictions with an ensemble model that combines output from our top models to produce a single prediction
- considering more complex models, such as neural networks (NN’s)
https://www.forbes.com/councils/forbesbusinesscouncil/2022/12/12/customer-retention-versus-customer-acquisition/ ↩