To develop a model capable of reasonably predicting (accurately) which customers are likely to churn in the near future?
A customer having closed all their active accounts with the bank is said to have churned. Churn can be defined in other ways as well, based on the context of the problem. A customer not transacting for 6 months or 1 year can also be defined as to have churned, based on the business requirements
(1) Business goal : Arrest slide in revenues or loss of active bank customers
(2) Identify data source : Transactional systems, event-based logs, Data warehouse (MySQL DBs etc.), Data Lakes, NoSQL DBs. (In this project the publicly available data has been considered from Kaggle website. Reference: https://www.kaggle.com/datasets/mervetorkan/churndataset)
(3) Audit for data quality : De-duplication of events/transactions, Complete or partial absence of data for chunks of time in between, Obscuring PII (personal identifiable information) data
(4) Define business and data-related metrics : Tracking of these metrics over time, probably through some intuitive visualizations
(i) Business metrics : Churn rate (month-on-month, weekly/quarterly), Trend of avg. number of products per customer,
%age of dormant customers, Other such descriptive metrics
(ii) Data-related metrics : F1-score, Recall, Precision
Recall = TP/(TP + FN)
Precision = TP/(TP + FP)
F1-score = Harmonic mean of Recall and Precision
where, TP = True Positive, FP = False Positive and FN = False Negative
(5) Prediction model output format : Since this is not going to be an online model, it doesn't require deployment. Instead, periodic (monthly/quarterly) model runs could be made and the list of customers, along with their propensity to churn shared with the business (Sales/Marketing) or Product team
(6) Action to be taken based on model's output/insights : Based on the output obtained from Data Science team as above, various business interventions can be made to save the customer from getting churned. Customer-centric bank offers, getting in touch with customers to address grievances etc. Here, also Data Science team can help with basic EDA to highlight different customer groups/segments and the appropriate intervention to be applied against them.
(1) Application deployment on production servers (In the context of this problem statement, not required)
(2) [DevOps] Monitoring the scale aspects of model performance over time (Again, not required, in this case)
"""!pip install ipython==7.22.0
!pip install joblib==1.0.1
!pip install lightgbm==3.3.1
!pip install matplotlib==3.3.4
!pip install numpy==1.20.1
!pip install pandas==1.3.5
!pip install scikit_learn==0.24.1
!pip install seaborn==0.11.1
!pip install shap==0.40.0
!pip install xgboost==1.5.1"""
'!pip install ipython==7.22.0\n!pip install joblib==1.0.1\n!pip install lightgbm==3.3.1\n!pip install matplotlib==3.3.4\n!pip install numpy==1.20.1\n!pip install pandas==1.3.5\n!pip install scikit_learn==0.24.1\n!pip install seaborn==0.11.1\n!pip install shap==0.40.0\n!pip install xgboost==1.5.1'
## Display plots and charts within the notebook itself, rather than in a separate window or external viewer
%matplotlib inline
## Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
## Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
## Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
## Display all rows and columns of a dataframe instead of a truncated version
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
## Reading the downloaded Kaggle dataset (https://www.kaggle.com/datasets/mervetorkan/churndataset)
df = pd.read_csv("Churn_Modelling.csv")
df.shape
(10000, 14)
df.head(10).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
RowNumber | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
CustomerId | 15634602 | 15647311 | 15619304 | 15701354 | 15737888 | 15574012 | 15592531 | 15656148 | 15792365 | 15592389 |
Surname | Hargrave | Hill | Onio | Boni | Mitchell | Chu | Bartlett | Obinna | He | H? |
CreditScore | 619 | 608 | 502 | 699 | 850 | 645 | 822 | 376 | 501 | 684 |
Geography | France | Spain | France | France | Spain | Spain | France | Germany | France | France |
Gender | Female | Female | Female | Female | Female | Male | Male | Female | Male | Male |
Age | 42 | 41 | 42 | 39 | 43 | 44 | 50 | 29 | 44 | 27 |
Tenure | 2 | 1 | 8 | 1 | 2 | 8 | 7 | 4 | 4 | 2 |
Balance | 0.0 | 83807.86 | 159660.8 | 0.0 | 125510.82 | 113755.78 | 0.0 | 115046.74 | 142051.07 | 134603.88 |
NumOfProducts | 1 | 1 | 3 | 2 | 1 | 2 | 2 | 4 | 2 | 1 |
HasCrCard | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
IsActiveMember | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 |
EstimatedSalary | 101348.88 | 112542.58 | 113931.57 | 93826.63 | 79084.1 | 149756.71 | 10062.8 | 119346.88 | 74940.5 | 71725.73 |
Exited | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
df.describe() # Describe all numerical columns
df.describe(include = ['O']) # Describe all non-numerical/categorical columns
RowNumber | CustomerId | CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 10000.00000 | 1.000000e+04 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 5000.50000 | 1.569094e+07 | 650.528800 | 38.921800 | 5.012800 | 76485.889288 | 1.530200 | 0.70550 | 0.515100 | 100090.239881 | 0.203700 |
std | 2886.89568 | 7.193619e+04 | 96.653299 | 10.487806 | 2.892174 | 62397.405202 | 0.581654 | 0.45584 | 0.499797 | 57510.492818 | 0.402769 |
min | 1.00000 | 1.556570e+07 | 350.000000 | 18.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 0.000000 | 11.580000 | 0.000000 |
25% | 2500.75000 | 1.562853e+07 | 584.000000 | 32.000000 | 3.000000 | 0.000000 | 1.000000 | 0.00000 | 0.000000 | 51002.110000 | 0.000000 |
50% | 5000.50000 | 1.569074e+07 | 652.000000 | 37.000000 | 5.000000 | 97198.540000 | 1.000000 | 1.00000 | 1.000000 | 100193.915000 | 0.000000 |
75% | 7500.25000 | 1.575323e+07 | 718.000000 | 44.000000 | 7.000000 | 127644.240000 | 2.000000 | 1.00000 | 1.000000 | 149388.247500 | 0.000000 |
max | 10000.00000 | 1.581569e+07 | 850.000000 | 92.000000 | 10.000000 | 250898.090000 | 4.000000 | 1.00000 | 1.000000 | 199992.480000 | 1.000000 |
Surname | Geography | Gender | |
---|---|---|---|
count | 10000 | 10000 | 10000 |
unique | 2932 | 3 | 2 |
top | Smith | France | Male |
freq | 32 | 5014 | 5457 |
df.head(3)
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
## Checking number of customers in the dataset
df.shape[0]
10000
df.Geography.value_counts(normalize=True)
France 0.5014 Germany 0.2509 Spain 0.2477 Name: Geography, dtype: float64
## Separating out different columns into various categories as defined above
target_var = ['Exited']
num_feats = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
cat_feats = ['Surname', 'Geography', 'Gender', 'HasCrCard', 'IsActiveMember']
Among these, Tenure and NumOfProducts are ordinal variables. HasCrCard and IsActiveMember are actually binary categorical variables.
## Separating out target variable and removing the non-essential columns
y = df[target_var].values
Customer transaction patterns can also help us ascertain whether the customer has actually churned or not. For example, a customer might transact daily/weekly vs a customer who transacts annually.
The objective of the questions is to understand the data and distill the problem statement and the stated goal further. In the process, if more data/context can be obtained, that adds to the end result of the model performance.
Since this is the only data available, we will keep aside a test set to evaluate our model at the very end in order to estimate our chosen model's performance on unseen data / new data.
A validation set is also created which I will use in baseline models to evaluate and tune our models.
from sklearn.model_selection import train_test_split
## Keeping aside a 10% test/holdout set
df_train_val, df_test, y_train_val, y_test = train_test_split(df, y.ravel(), test_size = 0.1, random_state = 42)
## Splitting into train and validation set
df_train, df_val, y_train, y_val = train_test_split(df_train_val, y_train_val, test_size = 0.12, random_state = 42)
df_train.shape, df_val.shape, df_test.shape, y_train.shape, y_val.shape, y_test.shape
np.mean(y_train), np.mean(y_val), np.mean(y_test)
((7920, 14), (1080, 14), (1000, 14), (7920,), (1080,), (1000,))
(0.20303030303030303, 0.22037037037037038, 0.191)
## CreditScore
sns.set(style="whitegrid")
sns.boxplot(y = df_train['CreditScore'])
<Axes: ylabel='CreditScore'>
## Age
sns.boxplot(y = df_train['Age'])
<Axes: ylabel='Age'>
## Tenure
sns.violinplot(y = df_train.Tenure)
<Axes: ylabel='Tenure'>
## Balance
sns.violinplot(y = df_train['Balance'])
<Axes: ylabel='Balance'>
## NumOfProducts
sns.set(style = 'ticks')
sns.distplot(df_train.NumOfProducts, hist=True, kde=False)
<Axes: xlabel='NumOfProducts'>
## EstimatedSalary
sns.kdeplot(df_train.EstimatedSalary)
<Axes: xlabel='EstimatedSalary', ylabel='Density'>
Can be observed from univariate plots of different features
Outliers can either be logically improbable (as per the feature definition) or just an extreme value as compared to the feature distribution
As part of outlier treatment, the particular row containing the outlier can be removed from the training set, provided they do not form a significant chunk of the dataset (< 0.5-1%)
In cases where the value of outlier is logically faulty, e.g. negative Age or CreditScore > 900, the particular record can be replaced with mean of the feature or the nearest among min/max logical value of the feature
Outliers in numerical features can be of a very high/low value, lying in the top 1% or bottom 1% of the distribution or values which are not possible as per the feature definition.
Outliers in categorical features are usually levels with a very low frequency/no. of samples as compared to other categorical levels.
No outliers observed in any feature of this dataset
## No missing values!
df_train.isnull().sum()
RowNumber 0 CustomerId 0 Surname 0 CreditScore 0 Geography 0 Gender 0 Age 0 Tenure 0 Balance 0 NumOfProducts 0 HasCrCard 0 IsActiveMember 0 EstimatedSalary 0 Exited 0 dtype: int64
No missing values present in this dataset.
As a rule of thumb, we can consider using :
## The non-sklearn method
df_train['Gender_cat'] = df_train.Gender.astype('category').cat.codes
df_train.sample(10)
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | Gender_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8088 | 8089 | 15815656 | Hopkins | 541 | Germany | Female | 39 | 9 | 100116.67 | 1 | 1 | 1 | 199808.10 | 1 | 0 |
9955 | 9956 | 15611338 | Kashiwagi | 714 | Spain | Male | 29 | 4 | 0.00 | 2 | 1 | 1 | 37605.90 | 0 | 1 |
7608 | 7609 | 15598574 | Uwakwe | 695 | Spain | Female | 31 | 5 | 0.00 | 2 | 0 | 1 | 13998.88 | 0 | 0 |
4025 | 4026 | 15640769 | Hobbs | 660 | France | Male | 63 | 8 | 137841.53 | 1 | 1 | 1 | 42790.29 | 0 | 1 |
6115 | 6116 | 15604813 | Zaytseva | 494 | France | Male | 40 | 7 | 0.00 | 2 | 0 | 1 | 158071.69 | 0 | 1 |
3224 | 3225 | 15713463 | Tate | 645 | Germany | Female | 41 | 2 | 138881.04 | 1 | 1 | 0 | 129936.53 | 1 | 0 |
468 | 469 | 15633283 | Padovano | 536 | France | Male | 35 | 8 | 0.00 | 2 | 1 | 0 | 64833.28 | 0 | 1 |
6348 | 6349 | 15707505 | Taylor | 699 | Spain | Male | 31 | 8 | 125927.51 | 2 | 1 | 0 | 147661.47 | 0 | 1 |
5615 | 5616 | 15775339 | Lori | 520 | France | Female | 29 | 8 | 95947.76 | 1 | 1 | 0 | 4696.44 | 0 | 0 |
5304 | 5305 | 15671345 | Piccio | 531 | Spain | Female | 42 | 6 | 75302.85 | 2 | 0 | 0 | 57034.35 | 0 | 0 |
df_train.drop('Gender_cat', axis=1, inplace = True)
## The sklearn method
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
We fit only on train dataset as that's the only data we'll assume we have. We'll treat validation and test sets as unseen data. Hence, they can't be used for fitting the encoders.
## Label encoding of Gender variable
df_train['Gender'] = le.fit_transform(df_train['Gender'])
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
le_name_mapping
{'Female': 0, 'Male': 1}
## Encoding Gender feature for validation and test set
df_val['Gender'] = df_val.Gender.map(le_name_mapping)
df_test['Gender'] = df_test.Gender.map(le_name_mapping)
## Filling missing/NaN values created due to new categorical levels
df_val['Gender'].fillna(-1, inplace=True)
df_test['Gender'].fillna(-1, inplace=True)
df_train.Gender.unique(), df_val.Gender.unique(), df_test.Gender.unique()
(array([1, 0]), array([1, 0]), array([1, 0]))
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le_ohe = LabelEncoder()
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse=False)
enc_train = le_ohe.fit_transform(df_train.Geography).reshape(df_train.shape[0],1)
enc_train.shape
np.unique(enc_train)
(7920, 1)
array([0, 1, 2])
ohe_train = ohe.fit_transform(enc_train)
ohe_train
array([[0., 1., 0.], [1., 0., 0.], [1., 0., 0.], ..., [1., 0., 0.], [0., 1., 0.], [0., 1., 0.]])
le_ohe_name_mapping = dict(zip(le_ohe.classes_, le_ohe.transform(le_ohe.classes_)))
le_ohe_name_mapping
{'France': 0, 'Germany': 1, 'Spain': 2}
## Encoding Geography feature for validation and test set
enc_val = df_val.Geography.map(le_ohe_name_mapping).ravel().reshape(-1,1)
enc_test = df_test.Geography.map(le_ohe_name_mapping).ravel().reshape(-1,1)
## Filling missing/NaN values created due to new categorical levels
enc_val[np.isnan(enc_val)] = 9999
enc_test[np.isnan(enc_test)] = 9999
np.unique(enc_val)
np.unique(enc_test)
array([0, 1, 2])
array([0, 1, 2])
ohe_val = ohe.transform(enc_val)
ohe_test = ohe.transform(enc_test)
### Show what happens when a new value is inputted into the OHE
ohe.transform(np.array([[9999]]))
array([[0., 0., 0.]])
cols = ['country_' + str(x) for x in le_ohe_name_mapping.keys()]
cols
['country_France', 'country_Germany', 'country_Spain']
## Adding to the respective dataframes
df_train = pd.concat([df_train.reset_index(), pd.DataFrame(ohe_train, columns = cols)], axis = 1).drop(['index'], axis=1)
df_val = pd.concat([df_val.reset_index(), pd.DataFrame(ohe_val, columns = cols)], axis = 1).drop(['index'], axis=1)
df_test = pd.concat([df_test.reset_index(), pd.DataFrame(ohe_test, columns = cols)], axis = 1).drop(['index'], axis=1)
print("Training set")
df_train.head()
print("\n\nValidation set")
df_val.head()
print("\n\nTest set")
df_test.head()
Training set
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4563 | 15795895 | Yermakova | 678 | Germany | 1 | 36 | 1 | 117864.85 | 2 | 1 | 0 | 27619.06 | 0 | 0.0 | 1.0 | 0.0 |
1 | 6499 | 15770405 | Warlow-Davies | 613 | France | 0 | 27 | 5 | 125167.74 | 1 | 1 | 0 | 199104.52 | 0 | 1.0 | 0.0 | 0.0 |
2 | 6073 | 15803908 | Fu | 628 | France | 1 | 45 | 9 | 0.00 | 2 | 1 | 1 | 96862.56 | 0 | 1.0 | 0.0 | 0.0 |
3 | 5814 | 15763515 | Shih | 513 | France | 1 | 30 | 5 | 0.00 | 2 | 1 | 0 | 162523.66 | 0 | 1.0 | 0.0 | 0.0 |
4 | 7408 | 15766663 | Mahmood | 639 | France | 1 | 22 | 4 | 0.00 | 2 | 1 | 0 | 28188.96 | 0 | 1.0 | 0.0 | 0.0 |
Validation set
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7219 | 15767231 | Sun | 757 | France | 1 | 36 | 7 | 144852.06 | 1 | 0 | 0 | 130861.95 | 0 | 1.0 | 0.0 | 0.0 |
1 | 8768 | 15585466 | Russo | 552 | France | 1 | 29 | 10 | 0.00 | 2 | 1 | 0 | 12186.83 | 0 | 1.0 | 0.0 | 0.0 |
2 | 2289 | 15579166 | Munro | 619 | France | 0 | 30 | 7 | 70729.17 | 1 | 1 | 1 | 160948.87 | 0 | 1.0 | 0.0 | 0.0 |
3 | 5361 | 15661349 | Perkins | 633 | France | 1 | 35 | 10 | 0.00 | 2 | 1 | 0 | 65675.47 | 0 | 1.0 | 0.0 | 0.0 |
4 | 1853 | 15573741 | Aliyeva | 698 | Spain | 1 | 38 | 10 | 95010.92 | 1 | 1 | 1 | 105227.86 | 0 | 0.0 | 0.0 | 1.0 |
Test set
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6253 | 15687492 | Anderson | 596 | Germany | 1 | 32 | 3 | 96709.07 | 2 | 0 | 0 | 41788.37 | 0 | 0.0 | 1.0 | 0.0 |
1 | 4685 | 15736963 | Herring | 623 | France | 1 | 43 | 1 | 0.00 | 2 | 1 | 1 | 146379.30 | 0 | 1.0 | 0.0 | 0.0 |
2 | 1732 | 15721730 | Amechi | 601 | Spain | 0 | 44 | 4 | 0.00 | 2 | 1 | 0 | 58561.31 | 0 | 0.0 | 0.0 | 1.0 |
3 | 4743 | 15762134 | Liang | 506 | Germany | 1 | 59 | 8 | 119152.10 | 2 | 1 | 1 | 170679.74 | 0 | 0.0 | 1.0 | 0.0 |
4 | 4522 | 15648898 | Chuang | 560 | Spain | 0 | 27 | 7 | 124995.98 | 1 | 1 | 1 | 114669.79 | 0 | 0.0 | 0.0 | 1.0 |
## Drop the Geography column
df_train.drop(['Geography'], axis = 1, inplace=True)
df_val.drop(['Geography'], axis = 1, inplace=True)
df_test.drop(['Geography'], axis = 1, inplace=True)
Target encoding is generally useful when dealing with categorical variables of high cardinality (high number of levels).
Here, we'll encode the column 'Surname' (which has 2932 different values!) with the mean of target variable for that level
df_train.head()
RowNumber | CustomerId | Surname | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4563 | 15795895 | Yermakova | 678 | 1 | 36 | 1 | 117864.85 | 2 | 1 | 0 | 27619.06 | 0 | 0.0 | 1.0 | 0.0 |
1 | 6499 | 15770405 | Warlow-Davies | 613 | 0 | 27 | 5 | 125167.74 | 1 | 1 | 0 | 199104.52 | 0 | 1.0 | 0.0 | 0.0 |
2 | 6073 | 15803908 | Fu | 628 | 1 | 45 | 9 | 0.00 | 2 | 1 | 1 | 96862.56 | 0 | 1.0 | 0.0 | 0.0 |
3 | 5814 | 15763515 | Shih | 513 | 1 | 30 | 5 | 0.00 | 2 | 1 | 0 | 162523.66 | 0 | 1.0 | 0.0 | 0.0 |
4 | 7408 | 15766663 | Mahmood | 639 | 1 | 22 | 4 | 0.00 | 2 | 1 | 0 | 28188.96 | 0 | 1.0 | 0.0 | 0.0 |
means = df_train.groupby(['Surname']).Exited.mean()
means.head()
Surname Abazu 0.00 Abbie 0.00 Abbott 0.25 Abdullah 1.00 Abdulov 0.00 Name: Exited, dtype: float64
global_mean = y_train.mean()
global_mean
0.20303030303030303
## Creating new encoded features for surname - Target (mean) encoding
df_train['Surname_mean_churn'] = df_train.Surname.map(means)
df_train['Surname_mean_churn'].fillna(global_mean, inplace=True)
But, the problem with Target encoding is that it might cause data leakage, as we are considering feedback from the target variable while computing any summary statistic.
A solution is to use a modified version : Leave-one-out Target encoding.
In this, for a particular data point or row, the mean of the target is calculated by considering all rows in the same categorical level except itself. This mitigates data leakage and overfitting to some extent.
Mean for a category, mc = Sc / nc ..... (1)
What we need to find is the mean excluding a single sample. This can be expressed as : mi = (Sc - ti) / (nc - 1) ..... (2)
Using (1) and (2), we can get : mi = (ncmc - ti) / (nc - 1)
Here, Sc = Sum of target variable for category c
nc = Number of rows in category c
ti = Target value of the row whose encoding is being calculated
## Calculate frequency of each category
freqs = df_train.groupby(['Surname']).size()
freqs.head()
Surname Abazu 2 Abbie 1 Abbott 4 Abdullah 1 Abdulov 1 dtype: int64
## Create frequency encoding - Number of instances of each category in the data
df_train['Surname_freq'] = df_train.Surname.map(freqs)
df_train['Surname_freq'].fillna(0, inplace=True)
## Create Leave-one-out target encoding for Surname
df_train['Surname_enc'] = ((df_train.Surname_freq * df_train.Surname_mean_churn) - df_train.Exited)/(df_train.Surname_freq - 1)
df_train.head(10)
RowNumber | CustomerId | Surname | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | Surname_mean_churn | Surname_freq | Surname_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4563 | 15795895 | Yermakova | 678 | 1 | 36 | 1 | 117864.85 | 2 | 1 | 0 | 27619.06 | 0 | 0.0 | 1.0 | 0.0 | 0.000000 | 4 | 0.000000 |
1 | 6499 | 15770405 | Warlow-Davies | 613 | 0 | 27 | 5 | 125167.74 | 1 | 1 | 0 | 199104.52 | 0 | 1.0 | 0.0 | 0.0 | 0.000000 | 2 | 0.000000 |
2 | 6073 | 15803908 | Fu | 628 | 1 | 45 | 9 | 0.00 | 2 | 1 | 1 | 96862.56 | 0 | 1.0 | 0.0 | 0.0 | 0.200000 | 10 | 0.222222 |
3 | 5814 | 15763515 | Shih | 513 | 1 | 30 | 5 | 0.00 | 2 | 1 | 0 | 162523.66 | 0 | 1.0 | 0.0 | 0.0 | 0.285714 | 21 | 0.300000 |
4 | 7408 | 15766663 | Mahmood | 639 | 1 | 22 | 4 | 0.00 | 2 | 1 | 0 | 28188.96 | 0 | 1.0 | 0.0 | 0.0 | 0.333333 | 3 | 0.500000 |
5 | 5045 | 15789498 | Miller | 562 | 1 | 30 | 3 | 111099.79 | 2 | 0 | 0 | 140650.19 | 0 | 1.0 | 0.0 | 0.0 | 0.285714 | 14 | 0.307692 |
6 | 973 | 15605918 | Padovesi | 635 | 1 | 43 | 5 | 78992.75 | 2 | 0 | 0 | 153265.31 | 0 | 0.0 | 1.0 | 0.0 | 0.200000 | 10 | 0.222222 |
7 | 5986 | 15702145 | Edments | 705 | 1 | 33 | 7 | 68423.89 | 1 | 1 | 1 | 64872.55 | 0 | 0.0 | 0.0 | 1.0 | 0.000000 | 1 | NaN |
8 | 9316 | 15653110 | Chan | 694 | 1 | 42 | 8 | 133767.19 | 1 | 1 | 0 | 36405.21 | 0 | 1.0 | 0.0 | 0.0 | 0.000000 | 3 | 0.000000 |
9 | 9825 | 15658980 | Matthews | 711 | 1 | 26 | 9 | 128793.63 | 1 | 1 | 0 | 19262.05 | 0 | 0.0 | 1.0 | 0.0 | 0.000000 | 4 | 0.000000 |
## Fill NaNs occuring due to category frequency being 1 or less
df_train['Surname_enc'].fillna((((df_train.shape[0] * global_mean) - df_train.Exited) / (df_train.shape[0] - 1)), inplace=True)
df_train.head(10)
RowNumber | CustomerId | Surname | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | Surname_mean_churn | Surname_freq | Surname_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4563 | 15795895 | Yermakova | 678 | 1 | 36 | 1 | 117864.85 | 2 | 1 | 0 | 27619.06 | 0 | 0.0 | 1.0 | 0.0 | 0.000000 | 4 | 0.000000 |
1 | 6499 | 15770405 | Warlow-Davies | 613 | 0 | 27 | 5 | 125167.74 | 1 | 1 | 0 | 199104.52 | 0 | 1.0 | 0.0 | 0.0 | 0.000000 | 2 | 0.000000 |
2 | 6073 | 15803908 | Fu | 628 | 1 | 45 | 9 | 0.00 | 2 | 1 | 1 | 96862.56 | 0 | 1.0 | 0.0 | 0.0 | 0.200000 | 10 | 0.222222 |
3 | 5814 | 15763515 | Shih | 513 | 1 | 30 | 5 | 0.00 | 2 | 1 | 0 | 162523.66 | 0 | 1.0 | 0.0 | 0.0 | 0.285714 | 21 | 0.300000 |
4 | 7408 | 15766663 | Mahmood | 639 | 1 | 22 | 4 | 0.00 | 2 | 1 | 0 | 28188.96 | 0 | 1.0 | 0.0 | 0.0 | 0.333333 | 3 | 0.500000 |
5 | 5045 | 15789498 | Miller | 562 | 1 | 30 | 3 | 111099.79 | 2 | 0 | 0 | 140650.19 | 0 | 1.0 | 0.0 | 0.0 | 0.285714 | 14 | 0.307692 |
6 | 973 | 15605918 | Padovesi | 635 | 1 | 43 | 5 | 78992.75 | 2 | 0 | 0 | 153265.31 | 0 | 0.0 | 1.0 | 0.0 | 0.200000 | 10 | 0.222222 |
7 | 5986 | 15702145 | Edments | 705 | 1 | 33 | 7 | 68423.89 | 1 | 1 | 1 | 64872.55 | 0 | 0.0 | 0.0 | 1.0 | 0.000000 | 1 | 0.203056 |
8 | 9316 | 15653110 | Chan | 694 | 1 | 42 | 8 | 133767.19 | 1 | 1 | 0 | 36405.21 | 0 | 1.0 | 0.0 | 0.0 | 0.000000 | 3 | 0.000000 |
9 | 9825 | 15658980 | Matthews | 711 | 1 | 26 | 9 | 128793.63 | 1 | 1 | 0 | 19262.05 | 0 | 0.0 | 1.0 | 0.0 | 0.000000 | 4 | 0.000000 |
On validation and test set, we'll apply the normal Target encoding mapping as obtained from the training set
## Replacing by category means and new category levels by global mean
df_val['Surname_enc'] = df_val.Surname.map(means)
df_val['Surname_enc'].fillna(global_mean, inplace=True)
df_test['Surname_enc'] = df_test.Surname.map(means)
df_test['Surname_enc'].fillna(global_mean, inplace=True)
## Show that using LOO Target encoding decorrelates features
df_train[['Surname_mean_churn', 'Surname_enc', 'Exited']].corr()
Surname_mean_churn | Surname_enc | Exited | |
---|---|---|---|
Surname_mean_churn | 1.000000 | 0.54823 | 0.562677 |
Surname_enc | 0.548230 | 1.00000 | -0.026440 |
Exited | 0.562677 | -0.02644 | 1.000000 |
### Deleting the 'Surname' and other redundant column across the three datasets
df_train.drop(['Surname_mean_churn'], axis=1, inplace=True)
df_train.drop(['Surname_freq'], axis=1, inplace=True)
df_train.drop(['Surname'], axis=1, inplace=True)
df_val.drop(['Surname'], axis=1, inplace=True)
df_test.drop(['Surname'], axis=1, inplace=True)
df_train.head()
df_val.head()
df_test.head()
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | Surname_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4563 | 15795895 | 678 | 1 | 36 | 1 | 117864.85 | 2 | 1 | 0 | 27619.06 | 0 | 0.0 | 1.0 | 0.0 | 0.000000 |
1 | 6499 | 15770405 | 613 | 0 | 27 | 5 | 125167.74 | 1 | 1 | 0 | 199104.52 | 0 | 1.0 | 0.0 | 0.0 | 0.000000 |
2 | 6073 | 15803908 | 628 | 1 | 45 | 9 | 0.00 | 2 | 1 | 1 | 96862.56 | 0 | 1.0 | 0.0 | 0.0 | 0.222222 |
3 | 5814 | 15763515 | 513 | 1 | 30 | 5 | 0.00 | 2 | 1 | 0 | 162523.66 | 0 | 1.0 | 0.0 | 0.0 | 0.300000 |
4 | 7408 | 15766663 | 639 | 1 | 22 | 4 | 0.00 | 2 | 1 | 0 | 28188.96 | 0 | 1.0 | 0.0 | 0.0 | 0.500000 |
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | Surname_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7219 | 15767231 | 757 | 1 | 36 | 7 | 144852.06 | 1 | 0 | 0 | 130861.95 | 0 | 1.0 | 0.0 | 0.0 | 0.111111 |
1 | 8768 | 15585466 | 552 | 1 | 29 | 10 | 0.00 | 2 | 1 | 0 | 12186.83 | 0 | 1.0 | 0.0 | 0.0 | 0.200000 |
2 | 2289 | 15579166 | 619 | 0 | 30 | 7 | 70729.17 | 1 | 1 | 1 | 160948.87 | 0 | 1.0 | 0.0 | 0.0 | 0.500000 |
3 | 5361 | 15661349 | 633 | 1 | 35 | 10 | 0.00 | 2 | 1 | 0 | 65675.47 | 0 | 1.0 | 0.0 | 0.0 | 0.000000 |
4 | 1853 | 15573741 | 698 | 1 | 38 | 10 | 95010.92 | 1 | 1 | 1 | 105227.86 | 0 | 0.0 | 0.0 | 1.0 | 1.000000 |
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | Surname_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6253 | 15687492 | 596 | 1 | 32 | 3 | 96709.07 | 2 | 0 | 0 | 41788.37 | 0 | 0.0 | 1.0 | 0.0 | 0.083333 |
1 | 4685 | 15736963 | 623 | 1 | 43 | 1 | 0.00 | 2 | 1 | 1 | 146379.30 | 0 | 1.0 | 0.0 | 0.0 | 0.203030 |
2 | 1732 | 15721730 | 601 | 0 | 44 | 4 | 0.00 | 2 | 1 | 0 | 58561.31 | 0 | 0.0 | 0.0 | 1.0 | 0.333333 |
3 | 4743 | 15762134 | 506 | 1 | 59 | 8 | 119152.10 | 2 | 1 | 1 | 170679.74 | 0 | 0.0 | 1.0 | 0.0 | 0.153846 |
4 | 4522 | 15648898 | 560 | 0 | 27 | 7 | 124995.98 | 1 | 1 | 1 | 114669.79 | 0 | 0.0 | 0.0 | 1.0 | 0.230769 |
## Check linear correlation (rho) between individual features and the target variable
corr = df_train.corr()
corr
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | Surname_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | 1.000000 | 0.003129 | 0.002474 | 0.025877 | -0.001585 | -0.012975 | -0.009557 | 0.001490 | -0.003153 | 0.018006 | -0.009468 | -0.017673 | 0.004881 | 0.003135 | -0.008747 | 0.010802 |
CustomerId | 0.003129 | 1.000000 | 0.005260 | 0.000753 | 0.007537 | -0.007726 | -0.017081 | 0.015325 | 0.002016 | -0.001213 | 0.017407 | -0.010944 | -0.007524 | -0.006682 | 0.015321 | -0.005145 |
CreditScore | 0.002474 | 0.005260 | 1.000000 | 0.000354 | 0.002099 | 0.005994 | -0.001507 | 0.014110 | -0.011868 | 0.035057 | 0.000358 | -0.028117 | -0.009481 | 0.003393 | 0.007561 | -0.000739 |
Gender | 0.025877 | 0.000753 | 0.000354 | 1.000000 | -0.024446 | 0.010749 | 0.009380 | -0.026795 | 0.007550 | 0.028094 | -0.011007 | -0.102331 | 0.000823 | -0.018412 | 0.017361 | 0.008002 |
Age | -0.001585 | 0.007537 | 0.002099 | -0.024446 | 1.000000 | -0.011384 | 0.027721 | -0.033305 | -0.019633 | 0.093573 | -0.006827 | 0.288221 | -0.038881 | 0.048764 | -0.003648 | -0.010844 |
Tenure | -0.012975 | -0.007726 | 0.005994 | 0.010749 | -0.011384 | 1.000000 | -0.013081 | 0.018231 | 0.026148 | -0.021263 | 0.010145 | -0.010660 | 0.000021 | -0.003131 | 0.003090 | -0.006753 |
Balance | -0.009557 | -0.017081 | -0.001507 | 0.009380 | 0.027721 | -0.013081 | 1.000000 | -0.304318 | -0.021464 | -0.008085 | 0.027247 | 0.113377 | -0.231770 | 0.405616 | -0.136044 | 0.006925 |
NumOfProducts | 0.001490 | 0.015325 | 0.014110 | -0.026795 | -0.033305 | 0.018231 | -0.304318 | 1.000000 | 0.007202 | 0.014809 | 0.009769 | -0.039200 | 0.002991 | -0.015926 | 0.012388 | -0.002020 |
HasCrCard | -0.003153 | 0.002016 | -0.011868 | 0.007550 | -0.019633 | 0.026148 | -0.021464 | 0.007202 | 1.000000 | -0.006526 | -0.008413 | -0.013659 | 0.005881 | 0.008197 | -0.014934 | -0.000551 |
IsActiveMember | 0.018006 | -0.001213 | 0.035057 | 0.028094 | 0.093573 | -0.021263 | -0.008085 | 0.014809 | -0.006526 | 1.000000 | -0.016446 | -0.152477 | 0.002126 | -0.020570 | 0.018003 | 0.004902 |
EstimatedSalary | -0.009468 | 0.017407 | 0.000358 | -0.011007 | -0.006827 | 0.010145 | 0.027247 | 0.009769 | -0.008413 | -0.016446 | 1.000000 | 0.015881 | -0.004512 | 0.010583 | -0.005320 | -0.009899 |
Exited | -0.017673 | -0.010944 | -0.028117 | -0.102331 | 0.288221 | -0.010660 | 0.113377 | -0.039200 | -0.013659 | -0.152477 | 0.015881 | 1.000000 | -0.106006 | 0.173492 | -0.050264 | -0.026440 |
country_France | 0.004881 | -0.007524 | -0.009481 | 0.000823 | -0.038881 | 0.000021 | -0.231770 | 0.002991 | 0.005881 | 0.002126 | -0.004512 | -0.106006 | 1.000000 | -0.575048 | -0.581494 | -0.007467 |
country_Germany | 0.003135 | -0.006682 | 0.003393 | -0.018412 | 0.048764 | -0.003131 | 0.405616 | -0.015926 | 0.008197 | -0.020570 | 0.010583 | 0.173492 | -0.575048 | 1.000000 | -0.331194 | -0.006132 |
country_Spain | -0.008747 | 0.015321 | 0.007561 | 0.017361 | -0.003648 | 0.003090 | -0.136044 | 0.012388 | -0.014934 | 0.018003 | -0.005320 | -0.050264 | -0.581494 | -0.331194 | 1.000000 | 0.014710 |
Surname_enc | 0.010802 | -0.005145 | -0.000739 | 0.008002 | -0.010844 | -0.006753 | 0.006925 | -0.002020 | -0.000551 | 0.004902 | -0.009899 | -0.026440 | -0.007467 | -0.006132 | 0.014710 | 1.000000 |
sns.heatmap(corr, cmap = 'coolwarm')
<Axes: >
None of the features are highly correlated with the target variable. But some of them have slight linear associations with the target variable.
Continuous features - Age, Balance
Categorical variables - Gender, IsActiveMember, country_Germany, country_France
sns.boxplot(x = "Exited", y = "Age", data = df_train, palette="Set3")
<Axes: xlabel='Exited', ylabel='Age'>
sns.violinplot(x = "Exited", y = "Balance", data = df_train, palette="Set3")
<Axes: xlabel='Exited', ylabel='Balance'>
# Check association of categorical features with target variable
cat_vars_bv = ['Gender', 'IsActiveMember', 'country_Germany', 'country_France']
for col in cat_vars_bv:
df_train.groupby([col]).Exited.mean()
Gender 0 0.248191 1 0.165511 Name: Exited, dtype: float64
IsActiveMember 0 0.266285 1 0.143557 Name: Exited, dtype: float64
country_Germany 0.0 0.163091 1.0 0.324974 Name: Exited, dtype: float64
country_France 0.0 0.245877 1.0 0.160593 Name: Exited, dtype: float64
col = 'NumOfProducts'
df_train.groupby([col]).Exited.mean()
df_train[col].value_counts()
NumOfProducts 1 0.273428 2 0.076881 3 0.825112 4 1.000000 Name: Exited, dtype: float64
1 4023 2 3629 3 223 4 45 Name: NumOfProducts, dtype: int64
df_train.columns
Index(['RowNumber', 'CustomerId', 'CreditScore', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited', 'country_France', 'country_Germany', 'country_Spain', 'Surname_enc'], dtype='object')
Creating some new features based on simple interactions between the existing features.
eps = 1e-6
df_train['bal_per_product'] = df_train.Balance/(df_train.NumOfProducts + eps)
df_train['bal_by_est_salary'] = df_train.Balance/(df_train.EstimatedSalary + eps)
df_train['tenure_age_ratio'] = df_train.Tenure/(df_train.Age + eps)
df_train['age_surname_mean_churn'] = np.sqrt(df_train.Age) * df_train.Surname_enc
df_train.head()
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | Surname_enc | bal_per_product | bal_by_est_salary | tenure_age_ratio | age_surname_mean_churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4563 | 15795895 | 678 | 1 | 36 | 1 | 117864.85 | 2 | 1 | 0 | 27619.06 | 0 | 0.0 | 1.0 | 0.0 | 0.000000 | 58932.395534 | 4.267519 | 0.027778 | 0.000000 |
1 | 6499 | 15770405 | 613 | 0 | 27 | 5 | 125167.74 | 1 | 1 | 0 | 199104.52 | 0 | 1.0 | 0.0 | 0.0 | 0.000000 | 125167.614832 | 0.628653 | 0.185185 | 0.000000 |
2 | 6073 | 15803908 | 628 | 1 | 45 | 9 | 0.00 | 2 | 1 | 1 | 96862.56 | 0 | 1.0 | 0.0 | 0.0 | 0.222222 | 0.000000 | 0.000000 | 0.200000 | 1.490712 |
3 | 5814 | 15763515 | 513 | 1 | 30 | 5 | 0.00 | 2 | 1 | 0 | 162523.66 | 0 | 1.0 | 0.0 | 0.0 | 0.300000 | 0.000000 | 0.000000 | 0.166667 | 1.643168 |
4 | 7408 | 15766663 | 639 | 1 | 22 | 4 | 0.00 | 2 | 1 | 0 | 28188.96 | 0 | 1.0 | 0.0 | 0.0 | 0.500000 | 0.000000 | 0.000000 | 0.181818 | 2.345208 |
new_cols = ['bal_per_product','bal_by_est_salary','tenure_age_ratio','age_surname_mean_churn']
## Ensuring that the new column doesn't have any missing values
df_train[new_cols].isnull().sum()
bal_per_product 0 bal_by_est_salary 0 tenure_age_ratio 0 age_surname_mean_churn 0 dtype: int64
## Linear association of new columns with target variables to judge importance
sns.heatmap(df_train[new_cols + ['Exited']].corr(), annot=True)
<Axes: >
Out of the new features, ones with slight linear association/correlation are : bal_per_product and tenure_age_ratio
## Creating new interaction feature terms for validation set
eps = 1e-6
df_val['bal_per_product'] = df_val.Balance/(df_val.NumOfProducts + eps)
df_val['bal_by_est_salary'] = df_val.Balance/(df_val.EstimatedSalary + eps)
df_val['tenure_age_ratio'] = df_val.Tenure/(df_val.Age + eps)
df_val['age_surname_mean_churn'] = np.sqrt(df_val.Age) * df_val.Surname_enc
## Creating new interaction feature terms for test set
eps = 1e-6
df_test['bal_per_product'] = df_test.Balance/(df_test.NumOfProducts + eps)
df_test['bal_by_est_salary'] = df_test.Balance/(df_test.EstimatedSalary + eps)
df_test['tenure_age_ratio'] = df_test.Tenure/(df_test.Age + eps)
df_test['age_surname_mean_churn'] = np.sqrt(df_test.Age) * df_test.Surname_enc
Different methods :
### Demo-ing feature transformations
sns.distplot(df_train.EstimatedSalary, hist=False)
<Axes: xlabel='EstimatedSalary', ylabel='Density'>
sns.distplot(np.sqrt(df_train.EstimatedSalary), hist=False)
#sns.distplot(np.log10(1+df_train.EstimatedSalary), hist=False)
<Axes: xlabel='EstimatedSalary', ylabel='Density'>
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df_train.columns
Index(['RowNumber', 'CustomerId', 'CreditScore', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited', 'country_France', 'country_Germany', 'country_Spain', 'Surname_enc', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_mean_churn'], dtype='object')
Scaling only continuous variables
cont_vars = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary', 'Surname_enc', 'bal_per_product'
, 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_mean_churn']
cat_vars = ['Gender', 'HasCrCard', 'IsActiveMember', 'country_France', 'country_Germany', 'country_Spain']
## Scaling only continuous columns
cols_to_scale = cont_vars
sc_X_train = sc.fit_transform(df_train[cols_to_scale])
## Converting from array to dataframe and naming the respective features/columns
sc_X_train = pd.DataFrame(data = sc_X_train, columns = cols_to_scale)
sc_X_train.shape
sc_X_train.head()
(7920, 11)
CreditScore | Age | Tenure | Balance | NumOfProducts | EstimatedSalary | Surname_enc | bal_per_product | bal_by_est_salary | tenure_age_ratio | age_surname_mean_churn | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.284761 | -0.274383 | -1.389130 | 0.670778 | 0.804059 | -1.254732 | -1.079210 | -0.062389 | 0.095448 | -1.232035 | -1.062507 |
1 | -0.389351 | -1.128482 | -0.004763 | 0.787860 | -0.912423 | 1.731950 | -1.079210 | 1.104840 | -0.118834 | 0.525547 | -1.062507 |
2 | -0.233786 | 0.579716 | 1.379604 | -1.218873 | 0.804059 | -0.048751 | 0.094549 | -1.100925 | -0.155854 | 0.690966 | 0.193191 |
3 | -1.426446 | -0.843782 | -0.004763 | -1.218873 | 0.804059 | 1.094838 | 0.505364 | -1.100925 | -0.155854 | 0.318773 | 0.321611 |
4 | -0.119706 | -1.602981 | -0.350855 | -1.218873 | 0.804059 | -1.244806 | 1.561746 | -1.100925 | -0.155854 | 0.487952 | 0.912973 |
## Mapping learnt on the continuous features
sc_map = {'mean':sc.mean_, 'std':np.sqrt(sc.var_)}
sc_map
{'mean': array([6.50542424e+02, 3.88912879e+01, 5.01376263e+00, 7.60258447e+04, 1.53156566e+00, 9.96616540e+04, 2.04321788e-01, 6.24727199e+04, 2.64665647e+00, 1.38117689e-01, 1.26136416e+00]), 'std': array([9.64231806e+01, 1.05374237e+01, 2.88940724e+00, 6.23738902e+04, 5.82587032e-01, 5.74167173e+04, 1.89325378e-01, 5.67456646e+04, 1.69816787e+01, 8.95590667e-02, 1.18715858e+00])}
## Scaling validation and test sets by transforming the mapping obtained through the training set
sc_X_val = sc.transform(df_val[cols_to_scale])
sc_X_test = sc.transform(df_test[cols_to_scale])
## Converting val and test arrays to dataframes for re-usability
sc_X_val = pd.DataFrame(data = sc_X_val, columns = cols_to_scale)
sc_X_test = pd.DataFrame(data = sc_X_test, columns = cols_to_scale)
Feature scaling is important for algorithms like Logistic Regression and SVM. Not necessary for Tree-based models
Features shortlisted through EDA/manual inspection and bivariate analysis :
Age, Gender, Balance, NumOfProducts, IsActiveMember, the 3 country/Geography variables, bal per product, tenure age ratio
Now, let's see whether feature selection/elimination through RFE (Recursive Feature Elimination) gives us the same list of features, other extra features or lesser number of features.
To begin with, we'll feed all features to RFE + LogReg model.
cont_vars
cat_vars
['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary', 'Surname_enc', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_mean_churn']
['Gender', 'HasCrCard', 'IsActiveMember', 'country_France', 'country_Germany', 'country_Spain']
## Creating feature-set and target for RFE model
y = df_train['Exited'].values
#X = pd.concat([df_train[cat_vars], sc_X_train[cont_vars]], ignore_index=True, axis = 1)
X = df_train[cat_vars + cont_vars]
X.columns = cat_vars + cont_vars
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# for logistics regression
est = LogisticRegression()
num_features_to_select = 10
# for logistics regression
rfe = RFE(estimator=est, n_features_to_select=num_features_to_select)
rfe = rfe.fit(X.values, y)
print(rfe.support_)
print(rfe.ranking_)
[ True True True True True True False True False False True False True False False True False] [1 1 1 1 1 1 4 1 3 6 1 8 1 7 5 1 2]
# for Decision Tree
est_dt = DecisionTreeClassifier()
num_features_to_select = 10
# for decision trees
rfe_dt = RFE(estimator=est_dt, n_features_to_select=num_features_to_select)
rfe_dt = rfe_dt.fit(X.values, y)
print(rfe_dt.support_)
print(rfe_dt.ranking_)
[False False True False False False True True False False True True True True True True True] [5 7 1 6 3 8 1 1 4 2 1 1 1 1 1 1 1]
## Logistic Regression (Linear model)
mask = rfe.support_.tolist()
selected_feats = [b for a,b in zip(mask, X.columns) if a]
selected_feats
['Gender', 'HasCrCard', 'IsActiveMember', 'country_France', 'country_Germany', 'country_Spain', 'Age', 'NumOfProducts', 'Surname_enc', 'tenure_age_ratio']
## Decision Tree (Non-linear model)
mask = rfe_dt.support_.tolist()
selected_feats_dt = [b for a,b in zip(mask, X.columns) if a]
selected_feats_dt
['IsActiveMember', 'CreditScore', 'Age', 'NumOfProducts', 'EstimatedSalary', 'Surname_enc', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_mean_churn']
We'll train the linear models on the features selected through RFE
from sklearn.linear_model import LogisticRegression
## Importing relevant metrics
from sklearn.metrics import roc_auc_score, f1_score, recall_score, confusion_matrix, classification_report
selected_cat_vars = [x for x in selected_feats if x in cat_vars]
selected_cont_vars = [x for x in selected_feats if x in cont_vars]
## Using categorical features and scaled numerical features
X_train = np.concatenate((df_train[selected_cat_vars].values, sc_X_train[selected_cont_vars].values), axis = 1)
X_val = np.concatenate((df_val[selected_cat_vars].values, sc_X_val[selected_cont_vars].values), axis = 1)
X_test = np.concatenate((df_test[selected_cat_vars].values, sc_X_test[selected_cont_vars].values), axis = 1)
X_train.shape, X_val.shape, X_test.shape
((7920, 10), (1080, 10), (1000, 10))
# Obtaining class weights based on the class samples imbalance ratio
_, num_samples = np.unique(y_train, return_counts = True)
weights = np.max(num_samples)/num_samples
weights
num_samples
array([1. , 3.92537313])
array([6312, 1608], dtype=int64)
weights_dict = dict()
class_labels = [0,1]
for a,b in zip(class_labels,weights):
weights_dict[a] = b
weights_dict
{0: 1.0, 1: 3.925373134328358}
## Defining model
lr = LogisticRegression(C = 1.0, penalty = 'l2', class_weight = weights_dict, n_jobs = -1)
## Fitting model
lr.fit(X_train, y_train)
LogisticRegression(class_weight={0: 1.0, 1: 3.925373134328358}, n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(class_weight={0: 1.0, 1: 3.925373134328358}, n_jobs=-1)
## Fitted model parameters
selected_cat_vars + selected_cont_vars
lr.coef_
lr.intercept_
['Gender', 'HasCrCard', 'IsActiveMember', 'country_France', 'country_Germany', 'country_Spain', 'Age', 'NumOfProducts', 'Surname_enc', 'tenure_age_ratio']
array([[-0.5190172 , -0.06938782, -0.90843476, -0.33748839, 0.58664742, -0.24918718, 0.80999582, -0.05061525, -0.0659637 , -0.05143544]])
array([0.60235927])
## Training metrics
roc_auc_score(y_train, lr.predict(X_train))
recall_score(y_train, lr.predict(X_train))
confusion_matrix(y_train, lr.predict(X_train))
print(classification_report(y_train, lr.predict(X_train)))
0.70684363354331
0.6983830845771144
array([[4515, 1797], [ 485, 1123]], dtype=int64)
precision recall f1-score support 0 0.90 0.72 0.80 6312 1 0.38 0.70 0.50 1608 accuracy 0.71 7920 macro avg 0.64 0.71 0.65 7920 weighted avg 0.80 0.71 0.74 7920
## Validation metrics
roc_auc_score(y_val, lr.predict(X_val))
recall_score(y_val, lr.predict(X_val))
confusion_matrix(y_val, lr.predict(X_val))
print(classification_report(y_val, lr.predict(X_val)))
0.7011966306712709
0.7016806722689075
array([[590, 252], [ 71, 167]], dtype=int64)
precision recall f1-score support 0 0.89 0.70 0.79 842 1 0.40 0.70 0.51 238 accuracy 0.70 1080 macro avg 0.65 0.70 0.65 1080 weighted avg 0.78 0.70 0.72 1080
from sklearn.svm import SVC
## Importing relevant metrics
from sklearn.metrics import roc_auc_score, f1_score, recall_score, confusion_matrix, classification_report
## Using categorical features and scaled numerical features
X_train = np.concatenate((df_train[selected_cat_vars].values, sc_X_train[selected_cont_vars].values), axis = 1)
X_val = np.concatenate((df_val[selected_cat_vars].values, sc_X_val[selected_cont_vars].values), axis = 1)
X_test = np.concatenate((df_test[selected_cat_vars].values, sc_X_test[selected_cont_vars].values), axis = 1)
X_train.shape, X_val.shape, X_test.shape
((7920, 10), (1080, 10), (1000, 10))
weights_dict = {0: 1.0, 1: 3.92}
weights_dict
{0: 1.0, 1: 3.92}
svm = SVC(C = 1.0, kernel = "linear", class_weight = weights_dict)
svm.fit(X_train, y_train)
SVC(class_weight={0: 1.0, 1: 3.92}, kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(class_weight={0: 1.0, 1: 3.92}, kernel='linear')
## Fitted model parameters
selected_cat_vars + selected_cont_vars
svm.coef_
svm.intercept_
['Gender', 'HasCrCard', 'IsActiveMember', 'country_France', 'country_Germany', 'country_Spain', 'Age', 'NumOfProducts', 'Surname_enc', 'tenure_age_ratio']
array([[-0.47122725, -0.05268599, -0.73099431, -0.3081861 , 0.55349692, -0.24531083, 0.87482056, -0.04784617, -0.05560977, -0.03824918]])
array([0.45465745])
## Training metrics
roc_auc_score(y_train, svm.predict(X_train))
recall_score(y_train, svm.predict(X_train))
confusion_matrix(y_train, svm.predict(X_train))
print(classification_report(y_train, svm.predict(X_train)))
0.7122715793655297
0.6940298507462687
array([[4611, 1701], [ 492, 1116]], dtype=int64)
precision recall f1-score support 0 0.90 0.73 0.81 6312 1 0.40 0.69 0.50 1608 accuracy 0.72 7920 macro avg 0.65 0.71 0.66 7920 weighted avg 0.80 0.72 0.75 7920
## Validation metrics
roc_auc_score(y_val, svm.predict(X_val))
recall_score(y_val, svm.predict(X_val))
confusion_matrix(y_val, svm.predict(X_val))
print(classification_report(y_val, svm.predict(X_val)))
0.6990508792590671
0.6890756302521008
array([[597, 245], [ 74, 164]], dtype=int64)
precision recall f1-score support 0 0.89 0.71 0.79 842 1 0.40 0.69 0.51 238 accuracy 0.70 1080 macro avg 0.65 0.70 0.65 1080 weighted avg 0.78 0.70 0.73 1080
To plot decision boundaries of classification models in a 2-D space, we first need to train our models on a 2-D space. The best option is to use our existing data (with > 2 features) and apply dimensionality reduction techniques (like PCA) on it and then train our models on this data with a reduced number of features
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
## Transforming the dataset using PCA
X = pca.fit_transform(X_train)
y = y_train
X_train.shape
X.shape
y.shape
(7920, 10)
(7920, 2)
(7920,)
## Checking the variance explained by the reduced features
pca.explained_variance_ratio_
array([0.2602733 , 0.18789887])
# Creating a mesh region where the boundary will be plotted
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))
## Fitting LR model on 2 features
lr.fit(X, y)
LogisticRegression(class_weight={0: 1.0, 1: 3.925373134328358}, n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(class_weight={0: 1.0, 1: 3.925373134328358}, n_jobs=-1)
## Fitting SVM model on 2 features
svm.fit(X,y)
SVC(class_weight={0: 1.0, 1: 3.92}, kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(class_weight={0: 1.0, 1: 3.92}, kernel='linear')
## Plotting decision boundary for LR
z1 = lr.predict(np.c_[xx.ravel(), yy.ravel()])
z1 = z1.reshape(xx.shape)
## Plotting decision boundary for SVM
z2 = svm.predict(np.c_[xx.ravel(), yy.ravel()])
z2 = z2.reshape(xx.shape)
# Displaying the result
plt.contourf(xx, yy, z1, alpha=0.4) # LR
plt.contour(xx, yy, z2, alpha=0.4, colors = 'blue') # SVM
sns.scatterplot(x=X[:,0], y=X[:,1], hue = y_train, s = 50, alpha = 0.8)
plt.title('Linear models - LogReg and SVM')
<matplotlib.contour.QuadContourSet at 0x10d651f7670>
<matplotlib.contour.QuadContourSet at 0x10d65244040>
<Axes: >
Text(0.5, 1.0, 'Linear models - LogReg and SVM')
from sklearn.tree import DecisionTreeClassifier
## Importing relevant metrics
from sklearn.metrics import roc_auc_score, f1_score, recall_score, confusion_matrix, classification_report
weights_dict = {0: 1.0, 1: 3.92}
weights_dict
{0: 1.0, 1: 3.92}
## Features selected from the RFE process
selected_feats_dt
['IsActiveMember', 'CreditScore', 'Age', 'NumOfProducts', 'EstimatedSalary', 'Surname_enc', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_mean_churn']
## Re-defining X_train and X_val to consider original unscaled continuous features. y_train and y_val remain unaffected
X_train = df_train[selected_feats_dt].values
X_val = df_val[selected_feats_dt].values
X_train.shape, y_train.shape
X_val.shape, y_val.shape
((7920, 10), (7920,))
((1080, 10), (1080,))
clf = DecisionTreeClassifier(criterion = 'entropy', class_weight = weights_dict, max_depth = 4, max_features = None
, min_samples_split = 25, min_samples_leaf = 15)
clf.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.92}, criterion='entropy', max_depth=4, min_samples_leaf=15, min_samples_split=25)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.92}, criterion='entropy', max_depth=4, min_samples_leaf=15, min_samples_split=25)
## Checking the importance of different features of the model
pd.DataFrame({'features': selected_feats,
'importance': clf.feature_importances_
}).sort_values(by = 'importance', ascending=False)
features | importance | |
---|---|---|
2 | IsActiveMember | 0.502526 |
3 | country_France | 0.353748 |
0 | Gender | 0.096951 |
7 | NumOfProducts | 0.043860 |
4 | country_Germany | 0.002915 |
1 | HasCrCard | 0.000000 |
5 | country_Spain | 0.000000 |
6 | Age | 0.000000 |
8 | Surname_enc | 0.000000 |
9 | tenure_age_ratio | 0.000000 |
## Training metrics
roc_auc_score(y_train, clf.predict(X_train))
recall_score(y_train, clf.predict(X_train))
confusion_matrix(y_train, clf.predict(X_train))
print(classification_report(y_train, clf.predict(X_train)))
0.7502896638480601
0.7195273631840796
array([[4930, 1382], [ 451, 1157]], dtype=int64)
precision recall f1-score support 0 0.92 0.78 0.84 6312 1 0.46 0.72 0.56 1608 accuracy 0.77 7920 macro avg 0.69 0.75 0.70 7920 weighted avg 0.82 0.77 0.79 7920
## Validation metrics
roc_auc_score(y_val, clf.predict(X_val))
recall_score(y_val, clf.predict(X_val))
confusion_matrix(y_val, clf.predict(X_val))
print(classification_report(y_val, clf.predict(X_val)))
0.7443162538174415
0.7142857142857143
array([[652, 190], [ 68, 170]], dtype=int64)
precision recall f1-score support 0 0.91 0.77 0.83 842 1 0.47 0.71 0.57 238 accuracy 0.76 1080 macro avg 0.69 0.74 0.70 1080 weighted avg 0.81 0.76 0.78 1080
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
## Transforming the dataset using PCA
X = pca.fit_transform(X_train)
y = y_train
X_train.shape
X.shape
y.shape
(7920, 10)
(7920, 2)
(7920,)
## Checking the variance explained by the reduced features
pca.explained_variance_ratio_
array([0.51069843, 0.48930008])
# Creating a mesh region where the boundary will be plotted
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 100),
np.arange(y_min, y_max, 100))
## Fitting tree model on 2 features
clf.fit(X, y)
DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.92}, criterion='entropy', max_depth=4, min_samples_leaf=15, min_samples_split=25)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.92}, criterion='entropy', max_depth=4, min_samples_leaf=15, min_samples_split=25)
## Plotting decision boundary for Decision Tree (DT)
z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
# Displaying the result
plt.contourf(xx, yy, z, alpha=0.4) # DT
sns.scatterplot(x=X[:,0], y=X[:,1], hue = y_train, s = 50, alpha = 0.8)
plt.title('Decision Tree')
<matplotlib.contour.QuadContourSet at 0x10d652ec940>
<Axes: >
Text(0.5, 1.0, 'Decision Tree')
Steps :
Automate data preparation and model run through Pipelines
Model Zoo : List of all models to compare/spot-check
Evaluate using k-fold Cross validation framework
Note : Restart the kernel and read the original dataset again followed by train-test split and then come directly to this section of the notebook
from sklearn.base import BaseEstimator, TransformerMixin
class CategoricalEncoder(BaseEstimator, TransformerMixin):
"""
Encodes categorical columns using LabelEncoding, OneHotEncoding and TargetEncoding.
LabelEncoding is used for binary categorical columns
OneHotEncoding is used for columns with <= 10 distinct values
TargetEncoding is used for columns with higher cardinality (>10 distinct values)
"""
def __init__(self, cols = None, lcols = None, ohecols = None, tcols = None, reduce_df = False):
"""
Parameters
----------
cols : list of str
Columns to encode. Default is to one-hot/target/label encode all categorical columns in the DataFrame.
reduce_df : bool
Whether to use reduced degrees of freedom for encoding
(that is, add N-1 one-hot columns for a column with N
categories). E.g. for a column with categories A, B,
and C: When reduce_df is True, A=[1, 0], B=[0, 1],
and C=[0, 0]. When reduce_df is False, A=[1, 0, 0],
B=[0, 1, 0], and C=[0, 0, 1]
Default = False
"""
if isinstance(cols,str):
self.cols = [cols]
else :
self.cols = cols
if isinstance(lcols,str):
self.lcols = [lcols]
else :
self.lcols = lcols
if isinstance(ohecols,str):
self.ohecols = [ohecols]
else :
self.ohecols = ohecols
if isinstance(tcols,str):
self.tcols = [tcols]
else :
self.tcols = tcols
self.reduce_df = reduce_df
def fit(self, X, y):
"""Fit label/one-hot/target encoder to X and y
Parameters
----------
X : pandas DataFrame, shape [n_samples, n_columns]
DataFrame containing columns to encode
y : pandas Series, shape = [n_samples]
Target values.
Returns
-------
self : encoder
Returns self.
"""
# Encode all categorical cols by default
if self.cols is None:
self.cols = [c for c in X if str(X[c].dtype)=='object']
# Check columns are in X
for col in self.cols:
if col not in X:
raise ValueError('Column \''+col+'\' not in X')
# Separating out lcols, ohecols and tcols
if self.lcols is None:
self.lcols = [c for c in self.cols if X[c].nunique() <= 2]
if self.ohecols is None:
self.ohecols = [c for c in self.cols if ((X[c].nunique() > 2) & (X[c].nunique() <= 10))]
if self.tcols is None:
self.tcols = [c for c in self.cols if X[c].nunique() > 10]
## Create Label Encoding mapping
self.lmaps = dict()
for col in self.lcols:
self.lmaps[col] = dict(zip(X[col].values, X[col].astype('category').cat.codes.values))
## Create OneHot Encoding mapping
self.ohemaps = dict() #dict to store map for each column
for col in self.ohecols:
self.ohemaps[col] = []
uniques = X[col].unique()
for unique in uniques:
self.ohemaps[col].append(unique)
if self.reduce_df:
del self.ohemaps[col][-1]
## Create Target Encoding mapping
self.global_target_mean = y.mean().round(2)
self.sum_count = dict()
for col in self.tcols:
self.sum_count[col] = dict()
uniques = X[col].unique()
for unique in uniques:
ix = X[col]==unique
self.sum_count[col][unique] = (y[ix].sum(),ix.sum())
## Return the fit object
return self
def transform(self, X, y=None):
"""Perform label/one-hot/target encoding transformation.
Parameters
----------
X : pandas DataFrame, shape [n_samples, n_columns]
DataFrame containing columns to label encode
Returns
-------
pandas DataFrame
Input DataFrame with transformed columns
"""
Xo = X.copy()
## Perform label encoding transformation
for col, lmap in self.lmaps.items():
# Map the column
Xo[col] = Xo[col].map(lmap)
Xo[col].fillna(-1, inplace=True) ## Filling new values with -1
## Perform one-hot encoding transformation
for col, vals in self.ohemaps.items():
for val in vals:
new_col = col+'_'+str(val)
Xo[new_col] = (Xo[col]==val).astype('uint8')
del Xo[col]
## Perform LOO target encoding transformation
# Use normal target encoding if this is test data
if y is None:
for col in self.sum_count:
vals = np.full(X.shape[0], np.nan)
for cat, sum_count in self.sum_count[col].items():
vals[X[col]==cat] = (sum_count[0]/sum_count[1]).round(2)
Xo[col] = vals
Xo[col].fillna(self.global_target_mean, inplace=True) # Filling new values by global target mean
# LOO target encode each column
else:
for col in self.sum_count:
vals = np.full(X.shape[0], np.nan)
for cat, sum_count in self.sum_count[col].items():
ix = X[col]==cat
if sum_count[1] > 1:
vals[ix] = ((sum_count[0]-y[ix].reshape(-1,))/(sum_count[1]-1)).round(2)
else :
vals[ix] = ((y.sum() - y[ix])/(X.shape[0] - 1)).round(2) # Catering to the case where a particular
# category level occurs only once in the dataset
Xo[col] = vals
Xo[col].fillna(self.global_target_mean, inplace=True) # Filling new values by global target mean
## Return encoded DataFrame
return Xo
def fit_transform(self, X, y=None):
"""Fit and transform the data via label/one-hot/target encoding.
Parameters
----------
X : pandas DataFrame, shape [n_samples, n_columns]
DataFrame containing columns to encode
y : pandas Series, shape = [n_samples]
Target values (required!).
Returns
-------
pandas DataFrame
Input DataFrame with transformed columns
"""
return self.fit(X, y).transform(X, y)
class AddFeatures(BaseEstimator):
"""
Add new, engineered features using original categorical and numerical features of the DataFrame
"""
def __init__(self, eps = 1e-6):
"""
Parameters
----------
eps : A small value to avoid divide by zero error. Default value is 0.000001
"""
self.eps = eps
def fit(self, X, y=None):
return self
def transform(self, X):
"""
Parameters
----------
X : pandas DataFrame, shape [n_samples, n_columns]
DataFrame containing base columns using which new interaction-based features can be engineered
"""
Xo = X.copy()
## Add 4 new columns - bal_per_product, bal_by_est_salary, tenure_age_ratio, age_surname_mean_churn
Xo['bal_per_product'] = Xo.Balance/(Xo.NumOfProducts + self.eps)
Xo['bal_by_est_salary'] = Xo.Balance/(Xo.EstimatedSalary + self.eps)
Xo['tenure_age_ratio'] = Xo.Tenure/(Xo.Age + self.eps)
Xo['age_surname_enc'] = np.sqrt(Xo.Age) * Xo.Surname_enc
## Returning the updated dataframe
return Xo
def fit_transform(self, X, y=None):
"""
Parameters
----------
X : pandas DataFrame, shape [n_samples, n_columns]
DataFrame containing base columns using which new interaction-based features can be engineered
"""
return self.fit(X,y).transform(X)
class CustomScaler(BaseEstimator, TransformerMixin):
"""
A custom standard scaler class with the ability to apply scaling on selected columns
"""
def __init__(self, scale_cols = None):
"""
Parameters
----------
scale_cols : list of str
Columns on which to perform scaling and normalization. Default is to scale all numerical columns
"""
self.scale_cols = scale_cols
def fit(self, X, y=None):
"""
Parameters
----------
X : pandas DataFrame, shape [n_samples, n_columns]
DataFrame containing columns to scale
"""
# Scaling all non-categorical columns if user doesn't provide the list of columns to scale
if self.scale_cols is None:
self.scale_cols = [c for c in X if ((str(X[c].dtype).find('float') != -1) or (str(X[c].dtype).find('int') != -1))]
## Create mapping corresponding to scaling and normalization
self.maps = dict()
for col in self.scale_cols:
self.maps[col] = dict()
self.maps[col]['mean'] = np.mean(X[col].values).round(2)
self.maps[col]['std_dev'] = np.std(X[col].values).round(2)
# Return fit object
return self
def transform(self, X):
"""
Parameters
----------
X : pandas DataFrame, shape [n_samples, n_columns]
DataFrame containing columns to scale
"""
Xo = X.copy()
## Map transformation to respective columns
for col in self.scale_cols:
Xo[col] = (Xo[col] - self.maps[col]['mean']) / self.maps[col]['std_dev']
# Return scaled and normalized DataFrame
return Xo
def fit_transform(self, X, y=None):
"""
Parameters
----------
X : pandas DataFrame, shape [n_samples, n_columns]
DataFrame containing columns to scale
"""
# Fit and return transformed dataframe
return self.fit(X).transform(X)
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
## Importing relevant metrics
from sklearn.metrics import roc_auc_score, f1_score, recall_score, confusion_matrix, classification_report
X = df_train.drop(columns = ['Exited'], axis = 1)
X_val = df_val.drop(columns = ['Exited'], axis = 1)
cols_to_scale = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio'
,'age_surname_enc']
weights_dict = {0 : 1.0, 1 : 3.92}
clf = DecisionTreeClassifier(criterion = 'entropy', class_weight = weights_dict, max_depth = 4, max_features = None
, min_samples_split = 25, min_samples_leaf = 15)
model = Pipeline(steps = [('categorical_encoding', CategoricalEncoder()),
('add_new_features', AddFeatures()),
('standard_scaling', CustomScaler(cols_to_scale)),
('classifier', clf)
])
# Fit pipeline with training data
model.fit(X,y_train)
Pipeline(steps=[('categorical_encoding', CategoricalEncoder(cols=[], lcols=[], ohecols=[], tcols=[])), ('add_new_features', AddFeatures()), ('standard_scaling', CustomScaler(scale_cols=['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_enc'])), ('classifier', DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.92}, criterion='entropy', max_depth=4, min_samples_leaf=15, min_samples_split=25))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('categorical_encoding', CategoricalEncoder(cols=[], lcols=[], ohecols=[], tcols=[])), ('add_new_features', AddFeatures()), ('standard_scaling', CustomScaler(scale_cols=['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_enc'])), ('classifier', DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.92}, criterion='entropy', max_depth=4, min_samples_leaf=15, min_samples_split=25))])
CategoricalEncoder(cols=[], lcols=[], ohecols=[], tcols=[])
AddFeatures()
CustomScaler(scale_cols=['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_enc'])
DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.92}, criterion='entropy', max_depth=4, min_samples_leaf=15, min_samples_split=25)
# Predict target values on val data
val_preds = model.predict(X_val)
## Validation metrics
roc_auc_score(y_val, val_preds)
recall_score(y_val, val_preds)
confusion_matrix(y_val, val_preds)
print(classification_report(y_val, val_preds))
0.7477394758378411
0.7436974789915967
array([[633, 209], [ 61, 177]], dtype=int64)
precision recall f1-score support 0 0.91 0.75 0.82 842 1 0.46 0.74 0.57 238 accuracy 0.75 1080 macro avg 0.69 0.75 0.70 1080 weighted avg 0.81 0.75 0.77 1080
Models : RF, LGBM, XGB, Naive Bayes (Gaussian/Multinomial), kNN
from sklearn.model_selection import cross_val_score
## Preparing data and a few common model parameters
X = df_train.drop(columns = ['Exited'], axis = 1)
y = y_train.ravel()
weights_dict = {0 : 1.0, 1 : 3.93}
_, num_samples = np.unique(y_train, return_counts = True)
weight = (num_samples[0]/num_samples[1]).round(2)
weight
cols_to_scale = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio'
,'age_surname_enc']
3.93
!pip install lightgbm
Requirement already satisfied: lightgbm in c:\users\biswas kumar\anaconda3\lib\site-packages (4.1.0) Requirement already satisfied: numpy in c:\users\biswas kumar\anaconda3\lib\site-packages (from lightgbm) (1.23.5) Requirement already satisfied: scipy in c:\users\biswas kumar\anaconda3\lib\site-packages (from lightgbm) (1.10.0)
[notice] A new release of pip is available: 23.1.2 -> 23.2.1 [notice] To update, run: python.exe -m pip install --upgrade pip
## Importing the models to be tried out
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB, BernoulliNB
## Preparing a list of models to try out in the spot-checking process
def model_zoo(models = dict()):
# Tree models
for n_trees in [21, 1001]:
models['rf_' + str(n_trees)] = RandomForestClassifier(n_estimators = n_trees, n_jobs = -1, criterion = 'entropy'
, class_weight = weights_dict, max_depth = 6, max_features = 0.6
, min_samples_split = 30, min_samples_leaf = 20)
models['lgb_' + str(n_trees)] = LGBMClassifier(boosting_type='dart', num_leaves=31, max_depth= 6, learning_rate=0.1
, n_estimators=n_trees, class_weight=weights_dict, min_child_samples=20
, colsample_bytree=0.6, reg_alpha=0.3, reg_lambda=1.0, n_jobs=- 1
, importance_type = 'gain')
models['xgb_' + str(n_trees)] = XGBClassifier(objective='binary:logistic', n_estimators = n_trees, max_depth = 6
, learning_rate = 0.03, n_jobs = -1, colsample_bytree = 0.6
, reg_alpha = 0.3, reg_lambda = 0.1, scale_pos_weight = weight)
models['et_' + str(n_trees)] = ExtraTreesClassifier(n_estimators=n_trees, criterion = 'entropy', max_depth = 6
, max_features = 0.6, n_jobs = -1, class_weight = weights_dict
, min_samples_split = 30, min_samples_leaf = 20)
# kNN models
for n in [3,5,11]:
models['knn_' + str(n)] = KNeighborsClassifier(n_neighbors=n)
# Naive-Bayes models
models['gauss_nb'] = GaussianNB()
models['multi_nb'] = MultinomialNB()
models['compl_nb'] = ComplementNB()
models['bern_nb'] = BernoulliNB()
return models
## Automation of data preparation and model run through pipelines
def make_pipeline(model):
'''
Creates pipeline for the model passed as the argument. Uses standard scaling only in case of kNN models.
Ignores scaling step for tree/Naive Bayes models
'''
if (str(model).find('KNeighborsClassifier') != -1):
pipe = Pipeline(steps = [('categorical_encoding', CategoricalEncoder()),
('add_new_features', AddFeatures()),
('standard_scaling', CustomScaler(cols_to_scale)),
('classifier', model)
])
else :
pipe = Pipeline(steps = [('categorical_encoding', CategoricalEncoder()),
('add_new_features', AddFeatures()),
('classifier', model)
])
return pipe
## Run/Evaluate all 15 models using KFold cross-validation (5 folds)
def evaluate_models(X, y, models, folds = 5, metric = 'recall'):
results = dict()
for name, model in models.items():
# Evaluate model through automated pipelines
pipeline = make_pipeline(model)
scores = cross_val_score(pipeline, X, y, cv = folds, scoring = metric, n_jobs = -1)
# Store results of the evaluated model
results[name] = scores
mu, sigma = np.mean(scores), np.std(scores)
# Printing individual model results
print('Model {}: mean = {}, std_dev = {}'.format(name, mu, sigma))
return results
## Spot-checking in action
models = model_zoo()
print('Recall metric')
results = evaluate_models(X, y , models, metric = 'recall')
print('F1-score metric')
results = evaluate_models(X, y , models, metric = 'f1')
Recall metric Model rf_21: mean = 0.7543391188251001, std_dev = 0.021237798796949578 Model lgb_21: mean = 0.758073566687951, std_dev = 0.01581181286136521 Model xgb_21: mean = 0.7705075366188734, std_dev = 0.019166612799032336 Model et_21: mean = 0.7369458795301949, std_dev = 0.016529085840720433 Model rf_1001: mean = 0.7468721580464774, std_dev = 0.02608196582703169 Model lgb_1001: mean = 0.6710106228594647, std_dev = 0.012282488678580759 Model xgb_1001: mean = 0.6529788510284245, std_dev = 0.017773601087205434 Model et_1001: mean = 0.7363170217294558, std_dev = 0.007236211597055009 Model knn_3: mean = 0.09204543255741957, std_dev = 0.013753522989051544 Model knn_5: mean = 0.0565855923840483, std_dev = 0.008152603714140307 Model knn_11: mean = 0.007466960778622704, std_dev = 0.0031826579873533473 Model gauss_nb: mean = 0.03360229097734177, std_dev = 0.01482587135145166 Model multi_nb: mean = 0.6480408660823127, std_dev = 0.024469176890743363 Model compl_nb: mean = 0.6480408660823127, std_dev = 0.024469176890743363 Model bern_nb: mean = 0.31030552814380524, std_dev = 0.022201596952259223 F1-score metric Model rf_21: mean = 0.6233211033279307, std_dev = 0.01857756196774385 Model lgb_21: mean = 0.6454484737598223, std_dev = 0.011268584479002735 Model xgb_21: mean = 0.6336898650141173, std_dev = 0.010999855221178339 Model et_21: mean = 0.5864867673151279, std_dev = 0.010212561221947775 Model rf_1001: mean = 0.629802844677254, std_dev = 0.013924654270513634 Model lgb_1001: mean = 0.6702244150165039, std_dev = 0.011312282031862549 Model xgb_1001: mean = 0.6679951476796304, std_dev = 0.017841513864945743 Model et_1001: mean = 0.5917074657460765, std_dev = 0.006681924670409543 Model knn_3: mean = 0.12098275495330124, std_dev = 0.016279820858690654 Model knn_5: mean = 0.08792764345516318, std_dev = 0.012407550035871183 Model knn_11: mean = 0.014162059771688612, std_dev = 0.005849379228489272 Model gauss_nb: mean = 0.0590496361950227, std_dev = 0.02409084296102369 Model multi_nb: mean = 0.347918726486763, std_dev = 0.011996279392046729 Model compl_nb: mean = 0.347918726486763, std_dev = 0.011996279392046729 Model bern_nb: mean = 0.34121749133649887, std_dev = 0.016767819528172967
Based on the relevant metric, a suitable model can be chosen for further hyperparameter tuning.
LightGBM is chosen for further hyperparameter tuning because it has the best performance on F1-Score metric and it came close second when comparing using Recall metric.
RandomSearchCV vs GridSearchCV
Grid Search results in a more precise hyperparameter tuning, thus resulting in better model performance. Intelligent tuning mechanism can also help reduce the time taken in GridSearch by a large factor
Will optimize on F1 metric. We could easily reach 75% Recall from the default parameters as seen earlier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from lightgbm import LGBMClassifier
## Preparing data and a few common model parameters
# Unscaled features will be used since it's a tree model
X_train = df_train.drop(columns = ['Exited'], axis = 1)
X_val = df_val.drop(columns = ['Exited'], axis = 1)
X_train.shape, y_train.shape
X_val.shape, y_val.shape
((7920, 19), (7920,))
((1080, 19), (1080,))
lgb = LGBMClassifier(boosting_type = 'dart', min_child_samples = 20, n_jobs = - 1, importance_type = 'gain', num_leaves = 31)
model = Pipeline(steps = [('categorical_encoding', CategoricalEncoder()),
('add_new_features', AddFeatures()),
('classifier', lgb)
])
## Exhaustive list of parameters
parameters = {'classifier__n_estimators':[10, 21, 51, 100, 201, 350, 501]
,'classifier__max_depth': [3, 4, 6, 9]
,'classifier__num_leaves':[7, 15, 31]
,'classifier__learning_rate': [0.03, 0.05, 0.1, 0.5, 1]
,'classifier__colsample_bytree': [0.3, 0.6, 0.8]
,'classifier__reg_alpha': [0, 0.3, 1, 5]
,'classifier__reg_lambda': [0.1, 0.5, 1, 5, 10]
,'classifier__class_weight': [{0:1,1:1.0}, {0:1,1:1.96}, {0:1,1:3.0}, {0:1,1:3.93}]
}
%%capture
search = RandomizedSearchCV(model, parameters, n_iter=20, cv=5, scoring='f1')
search.fit(X_train, y_train.ravel())
search.best_params_
search.best_score_
{'classifier__reg_lambda': 0.1, 'classifier__reg_alpha': 5, 'classifier__num_leaves': 15, 'classifier__n_estimators': 350, 'classifier__max_depth': 6, 'classifier__learning_rate': 0.05, 'classifier__colsample_bytree': 0.6, 'classifier__class_weight': {0: 1, 1: 1.96}}
0.6839981717351549
%%capture
search.cv_results_
## Current list of parameters
parameters = {'classifier__n_estimators':[201]
,'classifier__max_depth': [6]
,'classifier__num_leaves': [63]
,'classifier__learning_rate': [0.1]
,'classifier__colsample_bytree': [0.6, 0.8]
,'classifier__reg_alpha': [0, 1, 10]
,'classifier__reg_lambda': [0.1, 1, 5]
,'classifier__class_weight': [{0:1,1:3.0}]
}
%%capture
grid = GridSearchCV(model, parameters, cv = 5, scoring = 'f1', n_jobs = -1)
grid.fit(X_train, y_train.ravel())
grid.best_params_
grid.best_score_
{'classifier__class_weight': {0: 1, 1: 3.0}, 'classifier__colsample_bytree': 0.6, 'classifier__learning_rate': 0.1, 'classifier__max_depth': 6, 'classifier__n_estimators': 201, 'classifier__num_leaves': 63, 'classifier__reg_alpha': 0, 'classifier__reg_lambda': 0.1}
0.6824070101591271
%%capture
grid.cv_results_
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, f1_score, recall_score, confusion_matrix, classification_report
import joblib
## Re-defining X_train and X_val to consider original unscaled continuous features. y_train and y_val remain unaffected
X_train = df_train.drop(columns = ['Exited'], axis = 1)
X_val = df_val.drop(columns = ['Exited'], axis = 1)
X_train.shape, y_train.shape
X_val.shape, y_val.shape
((7920, 19), (7920,))
((1080, 19), (1080,))
best_f1_lgb = LGBMClassifier(boosting_type = 'dart', class_weight = {0: 1, 1: 3.0}, min_child_samples = 20, n_jobs = - 1
, importance_type = 'gain', max_depth = 6, num_leaves = 63, colsample_bytree = 0.6, learning_rate = 0.1
, n_estimators = 201, reg_alpha = 1.0, reg_lambda = 1.0)
model = Pipeline(steps = [('categorical_encoding', CategoricalEncoder()),
('add_new_features', AddFeatures()),
('classifier', best_f1_lgb)
])
# In order to avoid printing iterations output in notebook for model.fit, we will follow these steps
import sys
from io import StringIO
## Save the current sys.stdout
original_stdout = sys.stdout
## Redirect sys.stdout to a StringIO object
sys.stdout = StringIO()
## Fitting final model on train dataset
model.fit(X_train, y_train)
## Restore the original sys.stdout
sys.stdout = original_stdout
Pipeline(steps=[('categorical_encoding', CategoricalEncoder(cols=[], lcols=[], ohecols=[], tcols=[])), ('add_new_features', AddFeatures()), ('classifier', LGBMClassifier(boosting_type='dart', class_weight={0: 1, 1: 3.0}, colsample_bytree=0.6, importance_type='gain', max_depth=6, n_estimators=201, n_jobs=-1, num_leaves=63, reg_alpha=1.0, reg_lambda=1.0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('categorical_encoding', CategoricalEncoder(cols=[], lcols=[], ohecols=[], tcols=[])), ('add_new_features', AddFeatures()), ('classifier', LGBMClassifier(boosting_type='dart', class_weight={0: 1, 1: 3.0}, colsample_bytree=0.6, importance_type='gain', max_depth=6, n_estimators=201, n_jobs=-1, num_leaves=63, reg_alpha=1.0, reg_lambda=1.0))])
CategoricalEncoder(cols=[], lcols=[], ohecols=[], tcols=[])
AddFeatures()
LGBMClassifier(boosting_type='dart', class_weight={0: 1, 1: 3.0}, colsample_bytree=0.6, importance_type='gain', max_depth=6, n_estimators=201, n_jobs=-1, num_leaves=63, reg_alpha=1.0, reg_lambda=1.0)
# Predict target probabilities
val_probs = model.predict_proba(X_val)[:,1]
# Predict target values on val data
val_preds = np.where(val_probs > 0.45, 1, 0) # The probability threshold can be tweaked
sns.boxplot(x=y_val.ravel(),y= val_probs)
<Axes: >
## Validation metrics
roc_auc_score(y_val, val_preds)
recall_score(y_val, val_preds)
confusion_matrix(y_val, val_preds)
print(classification_report(y_val, val_preds))
0.7563823629214156
0.6386554621848739
array([[736, 106], [ 86, 152]], dtype=int64)
precision recall f1-score support 0 0.90 0.87 0.88 842 1 0.59 0.64 0.61 238 accuracy 0.82 1080 macro avg 0.74 0.76 0.75 1080 weighted avg 0.83 0.82 0.82 1080
## Save model object
joblib.dump(model, 'final_churn_model.sav')
['final_churn_model.sav']
import shap
shap.initjs()
ce = CategoricalEncoder()
af = AddFeatures()
X = ce.fit_transform(X_train, y_train)
X = af.transform(X)
X.shape
X.sample(5)
(7920, 20)
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | country_France | country_Germany | country_Spain | Surname_enc | bal_per_product | bal_by_est_salary | tenure_age_ratio | age_surname_mean_churn | age_surname_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7133 | 4064 | 15575691 | 689 | 0 | 58 | 5 | 0.00 | 2 | 0 | 1 | 49848.86 | 1.0 | 0.0 | 0.0 | 0.076923 | 0.000000 | 0.000000 | 0.086207 | 0.585829 | 0.585829 |
6234 | 9411 | 15734659 | 640 | 0 | 46 | 5 | 107978.40 | 2 | 1 | 0 | 155876.06 | 0.0 | 1.0 | 0.0 | 0.000000 | 53989.173005 | 0.692720 | 0.108696 | 0.000000 | 0.000000 |
6002 | 4885 | 15569274 | 678 | 1 | 49 | 2 | 116933.11 | 1 | 1 | 0 | 195053.58 | 0.0 | 1.0 | 0.0 | 0.181818 | 116932.993067 | 0.599492 | 0.040816 | 1.272727 | 1.272727 |
513 | 5262 | 15814022 | 714 | 0 | 26 | 9 | 89928.99 | 1 | 1 | 0 | 46203.31 | 1.0 | 0.0 | 0.0 | 0.203056 | 89928.900071 | 1.946375 | 0.346154 | 1.035386 | 1.035386 |
1283 | 6906 | 15754012 | 687 | 0 | 35 | 1 | 110752.15 | 2 | 1 | 1 | 47921.22 | 1.0 | 0.0 | 0.0 | 0.203056 | 55376.047312 | 2.311130 | 0.028571 | 1.201295 | 1.201295 |
#best_f1_lgb.fit(X, y_train)
# In order to avoid fit iterations output in notebook-
import sys
from io import StringIO
## Save the current sys.stdout
original_stdout = sys.stdout
## Redirect sys.stdout to a StringIO object
sys.stdout = StringIO()
## Fitting final model on train dataset
best_f1_lgb.fit(X, y_train)
## Restore the original sys.stdout
sys.stdout = original_stdout
LGBMClassifier(boosting_type='dart', class_weight={0: 1, 1: 3.0}, colsample_bytree=0.6, importance_type='gain', max_depth=6, n_estimators=201, n_jobs=-1, num_leaves=63, reg_alpha=1.0, reg_lambda=1.0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LGBMClassifier(boosting_type='dart', class_weight={0: 1, 1: 3.0}, colsample_bytree=0.6, importance_type='gain', max_depth=6, n_estimators=201, n_jobs=-1, num_leaves=63, reg_alpha=1.0, reg_lambda=1.0)
explainer = shap.TreeExplainer(best_f1_lgb)
X.head(10)
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | country_France | country_Germany | country_Spain | Surname_enc | bal_per_product | bal_by_est_salary | tenure_age_ratio | age_surname_mean_churn | age_surname_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4563 | 15795895 | 678 | 1 | 36 | 1 | 117864.85 | 2 | 1 | 0 | 27619.06 | 0.0 | 1.0 | 0.0 | 0.000000 | 58932.395534 | 4.267519 | 0.027778 | 0.000000 | 0.000000 |
1 | 6499 | 15770405 | 613 | 0 | 27 | 5 | 125167.74 | 1 | 1 | 0 | 199104.52 | 1.0 | 0.0 | 0.0 | 0.000000 | 125167.614832 | 0.628653 | 0.185185 | 0.000000 | 0.000000 |
2 | 6073 | 15803908 | 628 | 1 | 45 | 9 | 0.00 | 2 | 1 | 1 | 96862.56 | 1.0 | 0.0 | 0.0 | 0.222222 | 0.000000 | 0.000000 | 0.200000 | 1.490712 | 1.490712 |
3 | 5814 | 15763515 | 513 | 1 | 30 | 5 | 0.00 | 2 | 1 | 0 | 162523.66 | 1.0 | 0.0 | 0.0 | 0.300000 | 0.000000 | 0.000000 | 0.166667 | 1.643168 | 1.643168 |
4 | 7408 | 15766663 | 639 | 1 | 22 | 4 | 0.00 | 2 | 1 | 0 | 28188.96 | 1.0 | 0.0 | 0.0 | 0.500000 | 0.000000 | 0.000000 | 0.181818 | 2.345208 | 2.345208 |
5 | 5045 | 15789498 | 562 | 1 | 30 | 3 | 111099.79 | 2 | 0 | 0 | 140650.19 | 1.0 | 0.0 | 0.0 | 0.307692 | 55549.867225 | 0.789901 | 0.100000 | 1.685300 | 1.685300 |
6 | 973 | 15605918 | 635 | 1 | 43 | 5 | 78992.75 | 2 | 0 | 0 | 153265.31 | 0.0 | 1.0 | 0.0 | 0.222222 | 39496.355252 | 0.515399 | 0.116279 | 1.457209 | 1.457209 |
7 | 5986 | 15702145 | 705 | 1 | 33 | 7 | 68423.89 | 1 | 1 | 1 | 64872.55 | 0.0 | 0.0 | 1.0 | 0.203056 | 68423.821576 | 1.054743 | 0.212121 | 1.166468 | 1.166468 |
8 | 9316 | 15653110 | 694 | 1 | 42 | 8 | 133767.19 | 1 | 1 | 0 | 36405.21 | 1.0 | 0.0 | 0.0 | 0.000000 | 133767.056233 | 3.674397 | 0.190476 | 0.000000 | 0.000000 |
9 | 9825 | 15658980 | 711 | 1 | 26 | 9 | 128793.63 | 1 | 1 | 0 | 19262.05 | 0.0 | 1.0 | 0.0 | 0.000000 | 128793.501206 | 6.686393 | 0.346154 | 0.000000 | 0.000000 |
row_num = 7
shap_vals = explainer.shap_values(X.iloc[row_num].values.reshape(1,-1))
#base value
explainer.expected_value
[1.1359195912059852, -1.1359195912059852]
## Explain single prediction
shap.force_plot(explainer.expected_value[1], shap_vals[1], X.iloc[row_num], link = 'logit')
## Check probability predictions through the model
pred_probs = best_f1_lgb.predict_proba(X)[:,1]
pred_probs[row_num]
0.057789152169328895
## Explain global patterns/ summary stats
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)
Here, we'll use df_test as the unseen, future data
import joblib
## Load model object
model = joblib.load('final_churn_model.sav')
X_test = df_test.drop(columns = ['Exited'], axis = 1)
X_test.shape
y_test.shape
(1000, 19)
(1000,)
## Predict target probabilities
test_probs = model.predict_proba(X_test)[:,1]
## Predict target values on test data
test_preds = np.where(test_probs > 0.45, 1, 0) # Flexibility to tweak the probability threshold
#test_preds = model.predict(X_test)
sns.boxplot(x=y_test.ravel(), y=test_probs)
<Axes: >
## Test set metrics
roc_auc_score(y_test, test_preds)
recall_score(y_test, test_preds)
confusion_matrix(y_test, test_preds)
print(classification_report(y_test, test_preds))
0.7473870527249077
0.6282722513089005
array([[701, 108], [ 71, 120]], dtype=int64)
precision recall f1-score support 0 0.91 0.87 0.89 809 1 0.53 0.63 0.57 191 accuracy 0.82 1000 macro avg 0.72 0.75 0.73 1000 weighted avg 0.84 0.82 0.83 1000
## Adding predictions and their probabilities in the original test dataframe
test = df_test.copy()
test['predictions'] = test_preds
test['pred_probabilities'] = test_probs
test.sample(10)
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | country_France | country_Germany | country_Spain | Surname_enc | bal_per_product | bal_by_est_salary | tenure_age_ratio | age_surname_mean_churn | predictions | pred_probabilities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
220 | 7842 | 15789563 | 706 | 0 | 46 | 7 | 111288.18 | 1 | 1 | 1 | 149170.25 | 1 | 0.0 | 1.0 | 0.0 | 0.200000 | 111288.068712 | 0.746048 | 0.152174 | 1.356466 | 1 | 0.776129 |
244 | 1374 | 15771942 | 528 | 0 | 46 | 9 | 135555.66 | 1 | 1 | 0 | 133146.03 | 1 | 0.0 | 1.0 | 0.0 | 0.000000 | 135555.524444 | 1.018098 | 0.195652 | 0.000000 | 1 | 0.825677 |
285 | 6754 | 15568449 | 661 | 1 | 38 | 7 | 143006.70 | 1 | 1 | 1 | 15650.89 | 0 | 0.0 | 0.0 | 1.0 | 0.200000 | 143006.556993 | 9.137289 | 0.184211 | 1.232883 | 0 | 0.124975 |
416 | 264 | 15673693 | 682 | 0 | 26 | 0 | 110654.02 | 1 | 0 | 1 | 111879.21 | 0 | 1.0 | 0.0 | 0.0 | 0.203030 | 110653.909346 | 0.989049 | 0.000000 | 1.035255 | 0 | 0.037086 |
25 | 9921 | 15673020 | 678 | 0 | 49 | 3 | 204510.94 | 1 | 0 | 1 | 738.88 | 1 | 1.0 | 0.0 | 0.0 | 0.250000 | 204510.735489 | 276.785053 | 0.061224 | 1.750000 | 1 | 0.662824 |
567 | 5823 | 15671351 | 624 | 1 | 35 | 2 | 0.00 | 2 | 1 | 0 | 87310.59 | 0 | 0.0 | 0.0 | 1.0 | 0.285714 | 0.000000 | 0.000000 | 0.057143 | 1.690309 | 0 | 0.058693 |
138 | 1513 | 15586974 | 656 | 1 | 39 | 10 | 0.00 | 2 | 1 | 1 | 98894.64 | 0 | 1.0 | 0.0 | 0.0 | 0.203030 | 0.000000 | 0.000000 | 0.256410 | 1.267924 | 0 | 0.019151 |
59 | 2593 | 15658956 | 505 | 1 | 40 | 6 | 47869.69 | 2 | 1 | 1 | 155061.97 | 0 | 0.0 | 1.0 | 0.0 | 0.166667 | 23934.833033 | 0.308713 | 0.150000 | 1.054093 | 0 | 0.175242 |
462 | 1209 | 15616451 | 697 | 0 | 47 | 6 | 128252.66 | 1 | 1 | 1 | 168053.40 | 0 | 1.0 | 0.0 | 0.0 | 0.210526 | 128252.531747 | 0.763166 | 0.127660 | 1.443296 | 1 | 0.568420 |
886 | 6350 | 15699507 | 542 | 0 | 25 | 7 | 0.00 | 2 | 0 | 1 | 82393.08 | 0 | 1.0 | 0.0 | 0.0 | 0.203030 | 0.000000 | 0.000000 | 0.280000 | 1.015152 | 0 | 0.014402 |
Listing customers who have a churn probability higher than 70%. These are the ones who can be targeted immediately
high_churn_list = test[test.pred_probabilities > 0.7].sort_values(by = ['pred_probabilities'], ascending = False
).reset_index().drop(columns = ['index', 'Exited', 'predictions'], axis = 1)
high_churn_list.shape
high_churn_list.head()
(103, 20)
RowNumber | CustomerId | CreditScore | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | country_France | country_Germany | country_Spain | Surname_enc | bal_per_product | bal_by_est_salary | tenure_age_ratio | age_surname_mean_churn | pred_probabilities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2615 | 15640846 | 546 | 0 | 58 | 3 | 106458.31 | 4 | 1 | 0 | 128881.87 | 0.0 | 1.0 | 0.0 | 0.000000 | 26614.570846 | 0.826015 | 0.051724 | 0.000000 | 0.991885 |
1 | 8104 | 15740223 | 479 | 1 | 51 | 1 | 107714.74 | 3 | 1 | 0 | 86128.21 | 0.0 | 1.0 | 0.0 | 0.333333 | 35904.901365 | 1.250633 | 0.019608 | 2.380476 | 0.988333 |
2 | 377 | 15583456 | 745 | 1 | 45 | 10 | 117231.63 | 3 | 1 | 1 | 122381.02 | 0.0 | 1.0 | 0.0 | 0.250000 | 39077.196974 | 0.957923 | 0.222222 | 1.677051 | 0.976850 |
3 | 1255 | 15610383 | 628 | 0 | 46 | 1 | 46870.43 | 4 | 1 | 0 | 31272.14 | 1.0 | 0.0 | 0.0 | 1.000000 | 11717.604571 | 1.498792 | 0.021739 | 6.782330 | 0.973575 |
4 | 765 | 15672056 | 710 | 1 | 43 | 2 | 140080.32 | 3 | 1 | 1 | 157908.19 | 0.0 | 1.0 | 0.0 | 0.000000 | 46693.424436 | 0.887100 | 0.046512 | 0.000000 | 0.967526 |
high_churn_list.to_csv('high_churn_list.csv', index = False)
Based on business requirements, a prioritization matrix can be defined, wherein certain segments of customers are targeted first. These segments can be defined based on insights through data or the business teams' requirements. E.g. Males who are an ActiveMember, have a CreditCard and are from Germany can be prioritized first because the business potentially sees the max. ROI from them