Analysis Walkthrough: Supervised Classification with Bank Churn Data

21 minute read

Published:

This post provides a walkthrough demonstrating how to use the sklearn package in Python to tune and evaluate multiple supervised classification methods, such as logistic regression and extreme gradient boosting (XGBoost) to predict whether bank customers will close their account. The dataset comes from a past Kaggle competition and contains several variables, including credit score, gender, and age.

Code to produce this blog post can be found in this GitHub repository.


Data description

Data for this analysis comes from a previous Kaggle playground competition titled “Binary Classification with a Bank Churn Dataset”.

Bank churn, which is also known as customer attrition, is when customers end their relationship with the bank (close their accounts). Predicting churn is essential to allow the bank to take action to retain customers. The cost of acquiring a new customer is almost always higher than retaining an existing customer1.

Analysis Goal: Predict the probability that a customer Exited (probability of churn).

Train and test datsets are provided by Kaggle, and we want to minimize the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, which is also known as the “ROC AUC score”.

For this analysis, I’ll walk through the following steps:

  1. Exploratory data analysis
  2. Model building (tuning and evaluation)
  3. Prediction on new data

Exploratory data analysis (EDA)

For this analysis, we will be looking exclusively at the training dataset provided by Kaggle until we make our final predictions on the testing data. This mimics a real-life scenario where the future “new” data a model will be used on is not available until after model training, tuning, and selection.

First, we import necessary functions, load in the data, and evaluate the structure of the training dataset:

# Import necessary libraries/functions

# Data wrangling and computation
import pandas as pd
import numpy as np
# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn import model_selection
import sklearn.metrics as metrics
# Plotting
import matplotlib.pyplot as plt
from plotnine import ggplot, aes, facet_wrap, geom_histogram, labs
import seaborn as sns

# Load in train and test sets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Look at the first few rows of the training data
train.head()
idCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0015674932Okwudilichukwu668FranceMale33.030.0021.00.0181449.970
1115749177Okwudiliolisa627FranceMale33.010.0021.01.049503.500
2215694510Hsueh678FranceMale40.0100.0021.00.0184866.690
3315741417Kao581FranceMale34.02148882.5411.01.084560.880
4415766172Chiemenam716SpainMale33.050.0021.01.015068.830
# Check column types, see if any rows have null values
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165034 entries, 0 to 165033
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id               165034 non-null  int64  
 1   CustomerId       165034 non-null  int64  
 2   Surname          165034 non-null  object 
 3   CreditScore      165034 non-null  int64  
 4   Geography        165034 non-null  object 
 5   Gender           165034 non-null  object 
 6   Age              165034 non-null  float64
 7   Tenure           165034 non-null  int64  
 8   Balance          165034 non-null  float64
 9   NumOfProducts    165034 non-null  int64  
 10  HasCrCard        165034 non-null  float64
 11  IsActiveMember   165034 non-null  float64
 12  EstimatedSalary  165034 non-null  float64
 13  Exited           165034 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 17.6+ MB
# Check how many missing values are in the training data
train.apply(lambda x: x.isna().sum())
id                 0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

There are no null or missing values. We also have the following predictor variables (features):

Categorical

  • Geography: customer’s country of residence (France, Spain, or Germany)
  • Gender: customer’s gender (Male or Female)

Quantitative

  • CreditScore: customer’s credit score (numerical score)
  • Age: customer’s age (years)
  • Tenure: number of years a customer has had an account with the bank
  • Balance: customer’s account balance
  • NumOfProducts: number of bank products used by the customer (e.g., savings account, credit card)
  • EstimatedSalary: customer’s estimated salary

Logical

  • HasCrCard: whether the customer has a credit card (1 = yes, 0 = no)
  • IsActiveMember: whether the customer is an active member (1 = yes, 0 = no)

Miscellaneous (not useful)

  • id: Row number
  • CustomerID: unique identifier for each customer
  • Surname: customer’s last name

And the target (response) variable we want to predict: Exited.

# Basic summary statistics for each column in train
train.drop(columns=['id', 'CustomerId', 'Surname']).describe()

CreditScoreAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
count165034.000000165034.000000165034.000000165034.000000165034.000000165034.000000165034.000000165034.000000165034.000000
mean656.45437338.1258885.02035355478.0866891.5544550.7539540.497770112574.8227340.211599
std80.1033408.8672052.80615962817.6632780.5471540.4307070.49999750292.8655850.408443
min350.00000018.0000000.0000000.0000001.0000000.0000000.00000011.5800000.000000
25%597.00000032.0000003.0000000.0000001.0000001.0000000.00000074637.5700000.000000
50%659.00000037.0000005.0000000.0000002.0000001.0000000.000000117948.0000000.000000
75%710.00000042.0000007.000000119939.5175002.0000001.0000001.000000155152.4675000.000000
max850.00000092.00000010.000000250898.0900004.0000001.0000001.000000199992.4800001.000000

From the summary statistics, we see that overall churn is about 21% (about 21% of people exited).

To get an idea of the differences in values of each variable for those who closed their accounts and for those who did not, we can group by Exited and compute the means and standard deviations across various categories:

# Look at means and standard deviations of different variables for the people that closed their accounts and for those who did not
train.drop(columns = ['id', 'CustomerId', 'Surname', 'Geography', 'Gender']).groupby('Exited').agg(['mean', 'std']).round(2)
CreditScoreAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalary
meanstdmeanstdmeanstdmeanstdmeanstdmeanstdmeanstdmeanstd
Exited
0657.5979.7936.568.155.052.8051255.8162189.981.620.490.760.430.550.50112084.2950214.66
1652.2281.1443.969.004.912.8371209.9862646.691.330.660.740.440.290.46114402.5050542.03

In general, most of the variables have similar values between customers with Exited = 1 and customers with Exited = 0. However, the average age is slightly higher for those with Exited = 1, and IsActiveMember is higher for customers who did not close their accounts (55% for Exited = 0) than for those who did (29% for Exited = 1), indicating that people may be more likely to close their account if they are not an active member.

To further explore this data, let’s create some histograms of each variable with a different color for the customers with Exited = 1 and Exited = 0.

# Make a df for plotting, change "Exited" to a "category" type to assist with plotting
train_plot = train.astype({'Exited': 'category'})

# Plot quantitative variables
(
    # Reshape the data to facilitate plotting with ggplot
    pd.melt(train_plot[['CreditScore', 'Age' ,'Balance', 'EstimatedSalary', 'Exited']], id_vars = 'Exited')
    >> ggplot() +
        geom_histogram(aes('value', fill = 'Exited')) +
        facet_wrap('variable', scales = 'free') +
        labs(x = 'Category', y = 'Count', title = 'Histograms of Quantitative Variables')
)

Let’s make a plot of the correlation matrix for the 6 continuous quantitative predictor variables:

# Plot categorical variables and quantitative variables with few categories (NumOfProducts, Tenure)
(
    pd.melt(train_plot[['Gender', 'Geography', 'HasCrCard', 'IsActiveMember', 'NumOfProducts', 'Exited', 'Tenure']], id_vars = 'Exited')
    >> ggplot() +
        geom_histogram(aes('value', fill = 'Exited'), binwidth = .5) +
        facet_wrap('variable', scales = 'free') +
        labs(x = 'Value', y = 'Count', title = 'Histograms of Categorical Variables', subtitle = '(and Quantitative Variables with Few Categories)')
)

In general, the shapes of the distributions of the variables is similar. However, there is a noticeable difference in Age, where the majority of customers with Exited = 0 tend to be younger, while the majority of customers with Exited = 1 tend to be older, which is similar to what was observed based on the summary statistics.

Now, let’s look at correlations between the quantitative predictors to determine if any substantial multicollinearity is present:

sns.heatmap(train.drop(columns = ['id', 'CustomerId', 'Surname', 'Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited']).corr(), annot = True)
plt.title('Correlation Plot')
plt.show()

Most of the predictors are uncorrelated with one another, but there appears to be a weak linear relationship between Balance and NumOfProducts.

Feature engineering

A common step in many machine learning projects is to “engineer” new features. The goal is to create new predictor columns that will provide better model predictions of the target variable, Exited. Often, new features are created using domain expertise.

A naive approach to feature engineering is to simply augment the feature (predictor) matrix with polynomials of the quantitative variables. For simplicity, I’ll only add the following:

  • a squared EstimatedSalary column, denoted EstimatedSalary2
  • a squared CreditScore column, denoted CreditScore2
  • a squared Balance column, denoted Balance2
train['EstimatedSalary2'] = train['EstimatedSalary']**2
train['CreditScore2'] = train['CreditScore']**2
train['Balance2'] = train['Balance']**2

# The new variables are now part of the training data
train.head()
idCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedEstimatedSalary2CreditScore2Balance2
0015674932Okwudilichukwu668FranceMale33.030.0021.00.0181449.9703.292409e+104462240.000000e+00
1115749177Okwudiliolisa627FranceMale33.010.0021.01.049503.5002.450597e+093931290.000000e+00
2215694510Hsueh678FranceMale40.0100.0021.00.0184866.6903.417569e+104596840.000000e+00
3315741417Kao581FranceMale34.02148882.5411.01.084560.8807.150542e+093375612.216601e+10
4415766172Chiemenam716SpainMale33.050.0021.01.015068.8302.270696e+085126560.000000e+00

Model building, tuning, and selection

We will evaluate the performance of 6 different classification methods on predicting Exited in the training data:

  • logistic regression (LR)
  • k-nearest neighbors (KNN)
  • support vector machine (SVM)
  • random forest (RF)
  • extreme gradient boosting (XGBoost)
  • CatBoost (CB)

Below, each method will be tuned for optimal performance, and its performance will be evaluated in terms of ROC AUC. Then, the overall performance of each model will be compared, and the best model will be selected for final predictions on the testing data.

Set up the training data

Perform the following:

Create the predictor X_train matrix:

  • apply one hot encoding to the categorical variables
  • for SVM, center and scale the numerical predictors

Create the target y_train vector.

# Import additional functions for preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
# Creating the X_train matrix:

# Remove columns that are not predictors
X_train = train.drop(columns=['id', 'CustomerId','Surname', 'Exited'])

# Separate out the categorical columns for one hot encoding
X_train_encoding = X_train[['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']]

# Create a matrix of the one hot encoded categorical columns
encoder = OneHotEncoder(sparse_output = False)
X_train_encoded = encoder.fit_transform(X_train_encoding)

# Check the names of the columns that were created via one hot encoding
encoder.get_feature_names_out()

# The encoded matrix is currently a numpy array. Change this to a data frame
X_train_encoded = pd.DataFrame(X_train_encoded, columns = encoder.get_feature_names_out())

# Combine the numerical columns with the categorical columns to create a single X_train data frame
X_train = pd.concat([X_train.drop(columns = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']).reset_index(drop = True), X_train_encoded], axis = 1)

# Make sure everything looks correct
X_train.head()
CreditScoreAgeTenureBalanceNumOfProductsEstimatedSalaryEstimatedSalary2CreditScore2Balance2Geography_FranceGeography_GermanyGeography_SpainGender_FemaleGender_MaleHasCrCard_0.0HasCrCard_1.0IsActiveMember_0.0IsActiveMember_1.0
066833.030.002181449.973.292409e+104462240.000000e+001.00.00.00.01.00.01.01.00.0
162733.010.00249503.502.450597e+093931290.000000e+001.00.00.00.01.00.01.00.01.0
267840.0100.002184866.693.417569e+104596840.000000e+001.00.00.00.01.00.01.01.00.0
358134.02148882.54184560.887.150542e+093375612.216601e+101.00.00.00.01.00.01.00.01.0
471633.050.00215068.832.270696e+085126560.000000e+000.00.01.00.01.00.01.00.01.0
y_train = train['Exited']

Logistic Regression (LR)

X_train.head()
CreditScoreAgeTenureBalanceNumOfProductsEstimatedSalaryEstimatedSalary2CreditScore2Balance2Geography_FranceGeography_GermanyGeography_SpainGender_FemaleGender_MaleHasCrCard_0.0HasCrCard_1.0IsActiveMember_0.0IsActiveMember_1.0
066833.030.002181449.973.292409e+104462240.000000e+001.00.00.00.01.00.01.01.00.0
162733.010.00249503.502.450597e+093931290.000000e+001.00.00.00.01.00.01.00.01.0
267840.0100.002184866.693.417569e+104596840.000000e+001.00.00.00.01.00.01.01.00.0
358134.02148882.54184560.887.150542e+093375612.216601e+101.00.00.00.01.00.01.00.01.0
471633.050.00215068.832.270696e+085126560.000000e+000.00.01.00.01.00.01.00.01.0
# Initialize the logistic regression model (L1 regularization)
log_reg = LogisticRegression(solver = 'liblinear', penalty = 'l1')

# Set up 10-fold cross-validation so that we use the same 10 folds to evaluate all models.
kfold = model_selection.KFold(n_splits = 10, shuffle = True, random_state = 24)

# Perform 10-fold CV to evaluate logistic regression model performance
log_reg_auc = model_selection.cross_val_score(log_reg, X_train, y_train, cv = kfold, scoring = 'roc_auc')

# Logistic regression achieves a ROC AUC of about 81.8%
np.mean(log_reg_auc)
np.float64(0.8177866487039773)

K-Nearest Neighbors (KNN)

knn = KNeighborsClassifier()
# Set up a grid of hyperparameters to tune
knn_parameters = {'n_neighbors': [500, 1000, 2500, 5000],
                  'weights': ['uniform', 'distance']}

# Perform 10-fold cross-validation to obtain ROC AUC scores for each combination of hyperparameters
knn_cv = model_selection.GridSearchCV(knn, knn_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4) # use 4 cores in parallel to expedite tuning
knn_cv.fit(X_train, y_train)

# Make a dataframe to display the roc auc score for each combination of hyperparameters (sorted from best to worst)
pd.concat([pd.DataFrame(knn_cv.cv_results_['params']),
           pd.DataFrame(knn_cv.cv_results_['mean_test_score'],
                        columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)

# The best KNN model uses n_neighbors = 500, weights = uniform, and achieves a ROC AUC score of 60.1%
n_neighborsweightsroc auc
0500uniform0.601276
21000uniform0.600648
42500uniform0.599931
65000uniform0.598491
75000distance0.572966

Support Vector Machine Classifier (SVM)

svc = SVC()
svc_parameters = [
    {'kernel': ['linear'], 'C': [0.1, 1, 10]},
    {'kernel': ['rbf'], 'C': [0.1, 1, 10], 'gamma': [0.1, 1, 10]}
]

svc_cv = model_selection.GridSearchCV(svc, svc_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4)

# SVC is sensitive to the scale of the data, so let's first scale our continuous predictors
scaler = StandardScaler()

# Select the continuous features
X_train_continuous = X_train[['CreditScore', 'Age', 'Tenure', 'Balance',
                              'NumOfProducts', 'EstimatedSalary', 'EstimatedSalary2', 'CreditScore2', 'Balance2']]

# Apply scaling, turn into a dataframe
X_train_continuous_scaled = pd.DataFrame(scaler.fit_transform(X_train_continuous),
                                         columns = ['CreditScore', 'Age', 'Tenure', 'Balance',
                                                    'NumOfProducts', 'EstimatedSalary', 'EstimatedSalary2', 'CreditScore2', 'Balance2'])

# Combine all columns together to create the final training data
X_train_svc = pd.concat([X_train_continuous_scaled,
                         X_train.drop(columns = ['CreditScore', 'Age', 'Tenure', 'Balance',
                                                 'NumOfProducts', 'EstimatedSalary', 'EstimatedSalary2', 'CreditScore2', 'Balance2'])], axis = 1)

# Fit the SVC
svc_cv.fit(X_train_svc, y_train)
pd.concat([pd.DataFrame(svc_cv.cv_results_['params']),
           pd.DataFrame(svc_cv.cv_results_['mean_test_score'],
                        columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)

# The best SVM model uses C = 0.1, kernel = rbf, gamma = 0.1
# and achieves a ROC AUC score of 82.5%
Ckernelgammaroc auc
30.1rbf0.10.825011
40.1rbf1.00.822602
00.1linearNaN0.814939
11.0linearNaN0.814919
210.0linearNaN0.814917

Random Forest (RF)

rf = RandomForestClassifier()
rf_parameters = {'n_estimators': [500, 1000], 'max_features': [2, 4, 6], 'max_depth': [6, 9, 12]}

rf_cv = model_selection.GridSearchCV(rf, rf_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4)

rf_cv.fit(X_train, y_train)
pd.concat([pd.DataFrame(rf_cv.cv_results_['params']),
           pd.DataFrame(rf_cv.cv_results_['mean_test_score'],
                        columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)

# The best RF model uses max_depth = 12, max_features = 6, and n_estimators = 1000
# and achieves a ROC AUC score of 88.8%
max_depthmax_featuresn_estimatorsroc auc
1712610000.887904
161265000.887888
119610000.887874
10965000.887870
1512410000.887769

Extreme Gradient Boosting (XGBoost)

xgb_clf = xgb.XGBClassifier()
xgb_parameters = {'learning_rate': [0.05, 0.1, 0.15, 0.3], 'max_depth': [5, 6, 7], 'colsample_bytree': [0.25, 0.5, 1]}

xgb_cv = model_selection.GridSearchCV(xgb_clf, xgb_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4)

xgb_cv.fit(X_train, y_train)
pd.concat([pd.DataFrame(xgb_cv.cv_results_['params']),
           pd.DataFrame(xgb_cv.cv_results_['mean_test_score'],
                        columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)

# The best XGBoost model uses colsample_bytree = 0.5, learning_rate = 0.15, and max_depth = 5
# and achieves a ROC AUC score of about 89.0%
colsample_bytreelearning_ratemax_depthroc auc
180.50.1550.890029
150.50.1050.889848
160.50.1060.889840
301.00.1550.889653
271.00.1050.889615

CatBoost (CB)

cbc_clf = CatBoostClassifier(cat_features = [1, 2, 6, 7, 8], od_type = "Iter", od_wait = 20)
cbc_parameters = {'learning_rate': [0.1, 0.2, 0.3], 'depth': [4, 6, 8], 'iterations': [50, 100, 150]}

cbc_cv = model_selection.GridSearchCV(cbc_clf, cbc_parameters, scoring = 'roc_auc', cv = kfold, n_jobs = 4)

# CatBoost does not require one hot encoding of categorical variables, so we'll use the original training dataset here,
# changing some columns we'll treat as categories to "string" type to be processed properly by the model
catboost_train = train.drop(columns = ['id', 'CustomerId', 'Surname', 'Exited'])
catboost_train = catboost_train.astype({'NumOfProducts': 'string', 'HasCrCard': 'string', 'IsActiveMember': 'string'})


cbc_cv.fit(catboost_train, y_train)
pd.concat([pd.DataFrame(cbc_cv.cv_results_['params']),
           pd.DataFrame(cbc_cv.cv_results_['mean_test_score'],
                        columns = ['roc auc'])], axis = 1).sort_values('roc auc', ascending = False).head(5)

# The best catboost model uses depth = 6, iterations = 150, and learning_rate = 0.2,
# and achieves a ROC AUC score of just under 89.0%
depthiterationslearning_rateroc auc
1661500.20.889753
1361000.20.889599
741500.20.889540
2481500.10.889519
841500.30.889510

Model selection

The best performance of each method is given in the following table:

MethodROC AUC
LR81.8%
KNN60.1%
SVM82.2%
RF88.8%
XGBoost89.0%
CB89.0%

The best methods, by far, are RF, XGBoost, and CB, which had similar performance. However, I will select XGBoost as the final model because it achieved a slightly higher CV ROC AUC score than RF and CB.

One nice feature of XGBoost is the ability to easily visualize the importance of each variable for predicting Exited. A feature important plot for the final XGBoost model is shown below. Note that

  • The most important variables are CreditScore, Age, EstimatedSalary, and Balance.
  • Feature importance decreases greatly after Balance.
  • Being an active member (IsActiveMember = 1 or 0) is not as important as previously thought based on initial EDA.
from xgboost import plot_importance
plot_importance(xgb_cv.best_estimator_)
plt.show()

Prediction for the testing data

Now that we’ve selected our best model based on cross-validation on the training data, we can use that model to obtain predictions for the testing set.

# Add engineered features to testing data
test['EstimatedSalary2'] = test['EstimatedSalary']**2
test['CreditScore2'] = test['CreditScore']**2
test['Balance2'] = test['Balance']**2

# Next, we need to one hot encode the categorical variables in the testing data
X_test = test.drop(columns=['id', 'CustomerId','Surname'])

# Separate out the categorical columns for one hot encoding, apply one hot encoding
X_test_encoding = X_test[['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']]
X_test_encoded = encoder.transform(X_test_encoding)

# The encoded matrix is currently a numpy array. Change this to a data frame
X_test_encoded = pd.DataFrame(X_test_encoded, columns = encoder.get_feature_names_out())

# Combine the numerical columns with the categorical columns to create a single X_test data frame
X_test = pd.concat([X_test.drop(columns = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']).reset_index(drop = True), X_test_encoded], axis = 1)

# Make sure everything looks correct
X_test.head()
CreditScoreAgeTenureBalanceNumOfProductsEstimatedSalaryEstimatedSalary2CreditScore2Balance2Geography_FranceGeography_GermanyGeography_SpainGender_FemaleGender_MaleHasCrCard_0.0HasCrCard_1.0IsActiveMember_0.0IsActiveMember_1.0
058623.020.002160976.752.591351e+103433960.000000e+001.00.00.01.00.01.00.00.01.0
168346.020.00172549.275.263397e+094664890.000000e+001.00.00.01.00.00.01.01.00.0
265634.070.002138882.091.928823e+104303360.000000e+001.00.00.01.00.00.01.01.00.0
368136.080.001113931.571.298040e+104637610.000000e+001.00.00.00.01.00.01.01.00.0
475238.010121263.621139431.001.944100e+105655041.470487e+100.01.00.00.01.00.01.01.00.0
# Obtain predictions
final_predictions = pd.DataFrame(dict(id = test['id'],
                  Exited = xgb_cv.best_estimator_.predict_proba(X_test)[:, 1]))

# Look at the first few predictions
final_predictions.head()
idExited
01650340.023930
11650350.835616
21650360.028150
31650370.232614
41650380.338327

We now have our final predictions stored in the final_predictions dataframe. Now, we can save those results to a .csv file and submit them to Kaggle to obtain our final score.

# Save predictions to .csv
final_predictions.to_csv('churn_predictions.csv', index = False)

Final Kaggle competition results

After submitting the predictions to Kaggle, a ROC AUC score on the testing data is returned.

According to Kaggle, the final ROC AUC score on the testing data of our model is 0.88864, or about 88.9%, which would have placed us in the top 38.7% of submissions during the competition (1406 place out of 3633 teams).

There are many additional things we could have done to improve model performance, such as:

  • performing extensive feature engineering using domain expertise
  • increasing grid search ranges for tuning model parameters
  • making predictions with an ensemble model that combines output from our top models to produce a single prediction
  • considering more complex models, such as neural networks (NN’s)
  1. https://www.forbes.com/councils/forbesbusinesscouncil/2022/12/12/customer-retention-versus-customer-acquisition/