Lend and still be Profitable…

Janardhanan a r
23 min readJan 20, 2021

Objective

This is a two class label(1 — client with payment difficulties, 0 — all other cases) classification problem.

Table of Contents

1) Introduction

2) Business problem

3) Mapping to ML problem

4) Introduction to Datasets

5) Existing approaches

6) First Cut Approach

7) Exploratory Data Analysis (EDA)

8) Feature engineering

9) Model Explanation

10) Results

11) Conclusion and Future Work

12) Profile

13) References

1)Introduction

This blog is about how to apply technological advances from the field of machine learning to a sector that serves as one of the main pillars of society- the banking sector.

Large industries or common man like to full fill their dreams and wishes by availing loans from banks . If customer’s don’t pay back it has cascading and spiral down effect and can even lead to bank closing down the operations. To avoid these negative effects the banks are very careful to which customers the banks lend money.

Banks have been using systems that can identify/differentiate/classify customers who repay or don’t repay their loans. As and when stable technological advances are available banks have used/implemented those advances to figure out the validity of the customer. Now it is Machine learning which is going to help the bank make that decision.

2) Business problem

Machine learning is invading business applications. Many business problems have been solved many solutions have failed and others are still waiting to be conquered. The financial sector which is not new to financial systems being updated to computer applications and this time it’s machine learning. In this case study we are solving a problem in the banking sector relating to lending and collecting back money lent to customers. The banks have to access the re-paying capacity of the customer before it lends its valuable asset — money to the customer.

How does the bank evaluate the risk and re-paying capacity of the customer from the humoungus amount of data the bank collects from the customer?

The challenges this question poses is the problem that we will try applying machine learning algorithms to solve

a) Use Case

Banks collect vast amount of data in the form of personal info including income, no of children and family members payment history, and even credit history from other banks. Any pattern change in these tons of data can be detected by machine learning algorithms quickly. Customers who dont get credit come up with brilliant ideas or ways on how to circumvent the loan processing systems. Machine Learning systems can adapt and detect these new changes rapidly than traditional systems.

b)Machine learning models

  • analyze hugh volumes of data
  • understand linear as well as non-linear relationships in the data
  • Adapt to changes in the data patterns if productionized machine learning model are updated regularly

3)Mapping to ML problem

It is not that only ML can solve this problem. Even before the advent of ML this problem was being solved by computer applications. But, now the banks are gathering such volumes of data that has made it next to impossible for those applications to perform well with high accuracy within the stipulated time limit. As mentioned in the use case, customers are finding new ways and means to beat the system. They were able to do this since the earlier systems were not able to quickly understand and adapt to these changes.

This adaption is an inherent property of machine learning which can be leveraged to solve this problem.

Interpretability of the model is required since it will be the responsibility of the lending organization to explain to a customer as to why his loan was declined.

The final output of classification need not be accessed within a short duration. In other words the model/system does not have to arrive at the classification of Non-defaulter or Defaulter within millisecond’s after entering the details of the customer. Latency requirements is not a high priority for this model.

Cost of making errors within an organization is always high. Especially in this case study we are arriving at a conclusion of Non-Defaulter / Defaulter depending on the current data that is available with the lending organization. Can we pin the responsibility on the model if due to various reasons the customer predicted as Non-defaulter turns a defaulter n-periods later. The lending organization should have procedures in place to check the financial viability of the customer to pay back the loan after disbursing the loan.

The metric we would be using is the ROC-AUC-Score . This metric takes the class labels and the predict probabilities and generates a score.

Thanks to Home Credit to have put out such an extensive and rich dataset which will help the ML community and aspiring practitioners .

4) Introduction to Datasets

The data has been downloaded from kaggle.com

The following are the datasets that will be used in this case study :

1 . Application ==> Details about current customers applying for loans with HomeCredit. It is for this customer that we need to find whether he will be a defaulter or not.

2. Previous Application ==> Details about all the previous loans of the current customer

3. POS_cash_Balance ==> Details about Previous Point of Sale Cash balances of the current customer

4. Credit_card_Balance ==> Details about Previous Credit Card balances of the current customer

5. Installments ==>Details about the recurring payments for the Pos_cash and Credit card balances

6. Bureau ==> Central respository of individual/entities history of loans, repayments, bankruptcy and other details

7. Bureau_balance ==> Details about the actual payments made against loans recorded by the bureau

5) Existing approaches

This being a Kaggle competition and one that has closed, there are winner solutions available.

1st place solution

3rd place solution

6) First Cut Approach

  1. This problem contains data from multiple domains like point-of-sale, credit cards and actual installments. While building a single monolithic model encompassing features from all the datasets, these orphaned rows may create a problem. But when building individual models, the presence of these orphaned rows will help in producing better results when models are built using those datasets.
  2. There are lots of blanks in the dataset. How do we treat the blanks ? There are a couple of comments that state that their models performed better when no imputation was done for missing values. I will accept this hint and not perform any missing value imputation and check how the models perform.
  3. There are some 67 columns that talk about real estate details and whether documents have been submitted or not. To this we can add another 12 columns that are flag columns . From other kernels I have seen ext_source_1, ext_source2, ext_source_3 coming up as important features . I plan to include these columns also into a separate dataset. We will run a seperate model on this set of data and include the predicted probability in the main model. We will be using logistic regression for this model.
  4. Rate of interest will definitely be a strong feature, hence I would like to calculate this value. To calculate the interest rate from the data in the previous application we have amount of credit, amount of annuity including the interest amount and payment count. The competition host has said that annuity includes principal and interest amount hence this should be treated as Equated monthly installment(EMI). In this situation finding rate of interest get a bit complicated, as the expression to calculate rate of interest becomes quadratic and no closed-form solution exists and hence a close approximation can be applied and finally the formula shown below will calculate the rate of interest :
Formulae to calculate Rate of Interest
Formulae to calculate Rate of Interest

5. This will be the final model that will produce the final predictions of the TARGET column. There are couple of options that can be experimented with:

a. The dataset has lots of columns and feature engineering opportunities can be extended as far as imagination stretches. By using Dimensionality reduction and feature importance output we will try to bring down our feature set to about 100 features since we have about 100K rows. This is just a rule of thumb and not a constraint or requirement that needs to be strictly adhered to.

b. A single monolithic model with features from step(a.) above could be our base model. After this it’s just trial and error process with respect to adding features and hyper parameter tuning depending on the movement of the metric AUC score going higher or lower.

c. Stacking and voting classifiers are used extensively in Kaggle competitions, hence I will also be building a model with this kind of architecture. Due to the non-availability of the individual models I am at a disadvantage to share /write details of this model at this instant of time.

d. My individual models code will be encompassed in classes, modules and follow simple oops design. Thus the code will be modular and reusable allowing me to create a pipeline for the processing which is a major requirement for this case study

7) Exploratory Data Analysis (EDA)

Exploratory data analysis will be performed on the dataset to find out the appropriate statistical methods can be applied. By performing EDA we validate the raw data and check for irregularities or deviations. Exploring the dataset produces new insights , of which some are useful and others ir-relevant.

There are seven datasets , hence we will use application_train.csv to perform EDA and display those results in this post.

EDA has been performed for all other datasets . The EDA for the other datasets can be found in the github link here

Uni-variate analysis, is the process of analysing single data items in the data.

Multi-variate Analysis analyses the interaction between two or more data items .The output of these two analysis can be either reported graphically or non-graphically.

Class Imbalance

Class Im-balance
Class Imbalance

The above graph shows that the Class Labels are imbalanced and hence we need to adopt either under-sampling, oversampling or use the class-weights or scale-pos-weights parameter of the models.

I have used class-weights for handling class imbalance.

Univaraite Analysis — non-graphical

Let’s describe the dataset and look at the measures of central tendency. First, we will describe the distribution and then we will find the nature of the distribution

Displays measures of Central Tendency

This table displays the measures of Central limit theorem (CLT)

Displays Percentiles

Displays Percentiles

Nature of Distribution

The nature of the distribution is described by the skewness(symetrical) and kurtosis(peakedness) of the dataset. The dataset is symetrical if the skewness is zero

Skewness

skewness of data

Kurtosis

This talks about the peadkness of the graph or the shape of tails. A high kurtosis can also give an indication of the presence of outliers.A negative kurtosis like the one we find in our dataset will have thinner tails with occurance of extreme values less likely.

Displays kurtosis of the dataset
Kurtosis of the dataset

Standard deviation,variance and inter qartile range (IQR) are three mesures for Spread.This is a measure of how distant is the data from the center which could be the mean or median. The IQR is the most robust measure of spread as it is not affected by extreme values .

Spread of the data
Spread of the data

Uni-variate graphical analysis

KDE-Plots Kernel Density Estimate(KDE) is used for visualizing the Probability Density of a continuous variable. It depicts the probability density at different values in a continuous variable. In the same graph we can plot two series of data one belonging to the default and the other to non-default class allowing us to compare the two classes of data.

Kernel Density Estimate Plot for Amount Credit in application_train.csv
Kernel Density Estimate Plot for Amount Credit in application_train.csv
Kernel Density Estimate Plot for Amount Goods Price in application_train.csv
Bar graph for categorical variable NAME_INCOME_TYPE. Surprisingly students are not DEFAULTERS.
Bar graph for categorical variable NAME_INCOME_TYPE. Surprisingly students are not DEFAULTERS.
Bar graph for categorical variable ORGANIZATION_TYPE from application_train.csv
Bar graph for categorical variable ORGANIZATION_TYPE from application_train.csv

Multi-variate Graphical analysis

Multivariate graphical analysis — displays scatter plot between AMOUNT_ANNUITY and AMOUNT_CREDIT from application_train.csv
Multivariate graphical analysis — displays scatter plot between AMOUNT_ANNUITY and AMOUNT_CREDIT from application_train.csv

Multi-variate Non-graphical analysis

I have used only those columns that have a positive co-relation with the class labels.

Display co-relation values between features and class label
Display co-relation values between features and class label

8)Feature engineering

In this kind of money lending problem , rate of interest becomes crucial. This dataset being hosted by a functional Organization, these details have been withheld for obvious reasons.

Rate Of Interest

Rate of Interest can be calculated for those loans that have been sanctioned , hence we calculate the Rate Of Interest for the data in the previous_application.csv

class Xfrmer_rateofinterest(BaseEstimator,TransformerMixin):
"""
This calculates rate of interest should be one of the GOLDEN FEATURES
Lets wait and watch
"""
# constructor
def __init__(self):
#we are not going to use this
self.rteofintrst = None
#Return self
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
#rate of interest calculation I hope this will be a stellar feature let see
#the formula is numerator = 6(a-n)
# denominator = square(n) + (5+2a)n - 2a
# RATEOFINTEREST = numerator / denominator
n =X["CNT_PAYMENT"]
anuity = X["AMT_ANNUITY"]
numerator = 6 * (anuity - n)
denominator = np.square(n) + (5+2*anuity)*n - 2*anuity
rteofintrst = (numerator / denominator).replace((np.inf, -np.inf), (-9, -9))
X = rteofintrst.values.reshape(-1,1)
X[np.isnan(X)] = 0
print('rteint shape',X.shape)
return Xclass Xfrmer_rteofintrst(BaseEstimator,TransformerMixin):
"""
This calculates rate of interest should be one of the GOLDEN FEATURES
Lets wait and watch
"""
# constructor
def __init__(self):
#we are not going to use this
self.rteofintrst = None
#Return self
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
#rate of interest used by other competitors
#https://www.kaggle.com/c/home-credit-default-risk/discussion/64598
n =X["CNT_PAYMENT"]
anuity = X["AMT_ANNUITY"]
interst = (n*anuity) - X["AMT_CREDIT"]
numerator = 24 * interst
denominator =X["AMT_CREDIT"] * (n+1)
rteofintrst = (numerator / denominator).replace((np.inf, -np.inf), (-9, -9))
X = rteofintrst.values.reshape(-1,1)
X[np.isnan(X)] = 0
return X
```

Sklearn’s pipelines are used for generating column level features and also for OHE encoding. The advantage of the pipeline is that we can use fit on the training dataset and transform on the test datest. Below is an image of a sample pipeline used on the supplemental dataset credit_card_balance.csv.

There are features created by dividing , adding and subtracting features in all the six datasets.

The pipeline is structured in such a way that first we select the columns then perform any of the basic arithmetic operations

Pipeline for adding and subtracting two columns
Pipeline for adding and subtracting two columns
Pipeline for dividing two columns
Pipeline for dividing two columns

Missing Value Detection

The following snippet of code is used to calculate and display percentage of missing values along with skewness.

Depending on percentage of missing values our processing will vary .

  • less than 5% ==> fill with zeroes between
  • 5 and 20% ==> fill with mean / median or mode
  • 20 and 75 ==> check how the data is distributed then use model based imputation
  • greater than 75 % ==> average value

We will also generate the skew for every column since this will tell us where to use median or mode if the data is symmetric then we can use mean in other cases if the data has a skew either left or right we use median or mode

def display_mising_value(self):
"""
fn that displays the missing values of the dataset
"""
dct_tmp = {}
#self.df_mis_val = pd.DataFrame((self.df_source.isnull().sum())*100/se
lf.df_source.shape[0], columns=['COLNAME','PCT']).reset_index()
dct_tmp["PCT"] = (self.df_source.isnull().sum())*100/self.df_source.sh
ape[0]
dct_tmp["SKEWNES"] = (self.df_source.skew().round(2))
#self.df_mis_val = pd.DataFrame((self.df_source.isnull().sum())*100/se
lf.df_source.shape[0], columns=['PCT'])
self.df_mis_val = pd.DataFrame(dct_tmp,columns=["PCT","SKEWNES"])
self.df_misval_lvl1 = pd.DataFrame(self.df_mis_val.loc[(self.df_mis_va
l["PCT"]< 5)])
self.df_misval_lvl2 = pd.DataFrame(self.df_mis_val.loc[(self.df_mis_va
l["PCT"] >= 5) & (self.df_mis_val["PCT"] < 20 )])
self.df_misval_lvl3 = pd.DataFrame(self.df_mis_val.loc[(self.df_mis_va
l["PCT"] >= 20) & (self.df_mis_val["PCT"] < 75)])
self.df_misval_lvl4 = pd.DataFrame(self.df_mis_val.loc[(self.df_mis_va
l["PCT"] >= 75)])
self.df_misval_lvl1.sort_values(by="PCT", axis=0,inplace=True)
self.df_misval_lvl2.sort_values(by="PCT", axis=0,inplace=True)
self.df_misval_lvl3.sort_values(by="PCT", axis=0,inplace=True)
self.df_misval_lvl4.sort_values(by="PCT", axis=0,inplace=True)
Missing value percentage between 5 and 20 percent
Missing value percentage between 20 and 75 percent
Missing value percentage greater than 75 percent

Impute missing values

This is a dataset hosted by an organization on Kaggle hence some data was removed or substituted for other values. That’s why we have values like 365243, XNA, XAP. This does not fit the missing at random and other missing value strategies.

The solution to missing values is to impute those values. In this case study we cannot use the global median value to fill in the missing value. In the case of annuity amount this depends on the amount of credit and the repaying time which may vary depending on the customer. If you do a global median value impute that figure may be greater than the amount of credit for that customer.

In the case of missing values in categorical variables, we need to look at associated fields and then impute accordingly. We have two columns FLAG_OWN_CAR and OWN_CAR_AGE. If and only if a customer owns a car can you calculate its age. For the missing values in OWN_CAR_AGE column we need to first check FLAG_OWN_CAR is Yes then we store the median . The code below is also extract from a transform method from one of the custom transformers.

mdn =X.loc[(X['FLAG_OWN_CAR'] == 'Y'),'OWN_CAR_AGE'].median()   X.loc[(X['FLAG_OWN_CAR'] == 'Y' ) & (X['OWN_CAR_AGE'].isna()),'OWN_CAR _AGE'] = mdn  X.loc[(X['FLAG_OWN_CAR'] == 'N' ) & (X['OWN_CAR_AGE'].isna()),'OWN_CAR _AGE'] = 0#name_income_type='Commercial associate' and occupation_type  cat_mode =X.loc[(X['NAME_INCOME_TYPE'] == 'Commercial associate'),'OCC UPATION_TYPE'].value_counts().index[0]  )  X.loc[(X['NAME_INCOME_TYPE'] == 'Civil marriage' ) & (X['OCCUPATION_TY PE'].isna()),'OCCUPATION_TYPE'] = cat_mode
column transformer with simple Imputer
column transformer with simple Imputer

9) Model Explanation

Model Solution Description

There are many datasets in this project. The primary dataset which is application_train.csv and application_test.csv. There are five other supplemental datasets.

The primary dataset has a key named SK_ID_CURR. The supplemental datasets have a composite primary key that is a combination of SK_ID_CURR and SK_ID_PREV. One of the supplemental dataset bureau.csv and bureau_balance.csv have SK_ID_CURR and SK_ID_BUR as the composite key. These two dont have SK_ID_PREV as part of the primary key.

The approach that I have taken is to preserve the information contained in these supplemental files. I achieved this by joining the supplemental tables with SK_ID_CURR and gave each row the corresponding class label.

Dataset with class label
Dataset with class label

Now, we can model the supplemental datasets seperately and arrive at the probability of default for these customers. We will do the same for the credit card, pos cash and installment tables. Since, bureau and bureau_balance dont have SK_ID_PREV, we will first join bureau and application train using SK_ID_CURR. From this dataset we will join with bureau_balance using SK_ID_BUR.

In this strategy we will have predicted probabilities from six different datasets excluding bureau_balance. We generate aggregates from these six datasets using SK_ID_CURR and merge all these aggregates together along with the respective class labels .

We will use a super learner on this dataset to generate the final probabilities for predicting the final class labels of the application_test dataset.

In the bureau_balance dataset there is a column called STATUS. This column stores the Status of Credit Bureau loan during the month. Using the SK_ID_CURR and SK_ID_BUR key’s the class label is concatenated to every row. Now the row of bureau_balance.csv will look like SK_ID_BUR, STATUS and CLASS LABEL

102534 0 0 0 0 1 X 0 2 X X 3 1 2 3 1← class label

102535 0 0 0 0 1 00 0X X 3 1 1 1 0← class label

We will use a vectorizer and create n-grams out of the STATUS column and find out those patterns that are related to the defaulters class label. The n-grams that we create are like (3,3), (6,6), (9,9),(12,12). These n-grams represent the status of the loan payment for that customer monthly. The above n-gram used are the status for the 3rd, 6th and 1 year pattern. When we create a feature importance we will get which of these patterns are associated with class label 1.

Unified Modeling Language (UML)

UML for classes
Missing value percentage greater than 75 percent

Workflow

The workflow that was adopted in solving this problem.

· Check for Class imbalance

· Missing Value Detection

· Creation of Train/Test split

· for each of the datasets perform the following using pipelines

  • data cleaning
  • impute missing values
  • generate column level features
  • generate One-hot encoding
  • Data is ready , now pass it on to the machine learning model
  • Perform hyper-parameter tuning depending on the model
  • Process the model using Hyper-parameter tuned values
  • Using test data generate predictions and probabilities
  • persist the data to external storage

Creation of Train/Test split

The supplemental dataset’s don’t have their own train and test datasets. I have merged the application train and test files with the supplemental datasets to produce the individual train/test files.

The data in the supplemental files have a time-series aspect and that stops us from using the usual train_test_split function. Due to this the train dataset would have to have data from the past data and the test dataset will have data from earlier time periods.

All rows that pertain to data three years and above will be in the train dataset and data less than three years will be part of the test dataset.

Data Cleaning

There are many NA/NAN values, plus specific to this dataset we have a very large value 365243 which needs to be replaced in all the datasets.

I used Sklearn’s Column transformer with Custom transformer like the one below

column transformer with simple Imputer
Column Transformer with Custom Transformer for Data Cleaning

Whenever you want to replace values within a dataframe using a pipeline we need to pass it as a numpy array to avoid the SettingWithCopyWarning. The other way to do this is to make a copy of the data within the Custom Transformer make changes and pass it back. I didn’t like duplicating the data hence I passed numpy arrays to the custom transformer.

When numpy array’s are passed to the transformer, we need to pass the index of the column that we want to process, whereas in case you are passing dataframes then you would obviously pass the column names.

class Xfrmer_replacenp(BaseEstimator, TransformerMixin):
"""
this transformer does the global repplace within the dataframe
replace 365243 spcific to this case study with 0
replace +/-inf , nan with zero
"""
# constructor
def __init__(self):
#we are not going to use this
self._features = None

def fit(self, X,y=None ):
return self
def transform(self,X,y=None):
X[X==365243.0] = 0
X[X=="XAP"] = 0
X[X=="XNA"] = 0
X[X=="nan"] = 0
print('all replace',X.shape)
return X

The other situation I faced was that I had to deal with numpy array’s with dtype as Object. In this case convert the numpy array to a dataframe perform the operation , in my case it was fillna with zero and returned back the dataframe.

if tmpval.dtype == 'object':
tmpval= pd.DataFrame(tmpval)
tmpval.fillna(0,inplace=True)
X = tmpval

One hot encoding

Sklearn’s pipelines are used for generating for OHE encoding. The advantage of the pipeline is that we can use fit on the training dataset and transform on the test datest. Below is an image of a sample pipeline used on the supplemental dataset credit_card_balance.csv.

The pipeline is structured in such a way that first we select the columns then perform OHE.

Following is a sample pipeline for One-hot encoding of Categorical variable in the credit_card_balance.csv

Pipeline for converting categorical to numeric using one-hot encoding
Pipeline for converting categorical to numeric using one-hot encoding

After one hot encoding we need to generate column names for the new columns.The following code is used to generate the column names for the OHE columns. We append that with the numerical column names and recreate the complete dataset. We need to go thru becoz we need to get feature importances or perform selection of best features.

col_lvl_ft_names = []
ft_union = crdcrd_ft_gen_piplin['engineer_data']
tpl_xfrmr_lst =ft_union.transformer_list
catpipe = tpl_xfrmr_lst[22][1]
ccohe = catpipe.named_steps['CC_OHE']
#print(type(ccohe))
print(len(ccohe.categories_))
lst_ohe_ft_name = ccohe.categories_[0]
print('Generating column names for numerical features...')
for xfrmr_lvl in range(len(tpl_xfrmr_lst)-1):
col_lvl_ft_names.append(tpl_xfrmr_lst[xfrmr_lvl][0])
#add OHE categories as column names
print('Generating column names for categorical features...')
for itm in lst_ohe_ft_name :
col_lvl_ft_names.append("CC_NMECTRCTSTAT_"+ itm.upper())
#recreate the dataframe with imputed column values
df_trn_colvl = pd.DataFrame(X_trn_xfrm,columns=col_lvl_ft_names)
df_tst_colvl = pd.DataFrame(x_tst_xfrm,columns=col_lvl_ft_names)

#concatenate the two dataframes to the original dataframe
dtprcs.X_train = pd.concat([dtprcs.X_train,df_trn_colvl], axis=1)
dtprcs.x_test = pd.concat([dtprcs.x_test,df_tst_colvl], axis=1)

That brings us to the end of the stage where the data is ready to be passed on to the machine learning models.

Model Training

To train a machine learning model the following steps have to be performed:

  1. All models have parameters and we need to find the best or optimum values for these parameters to get the best results from the model. This is termed as Hyper Parameter tuning.
  2. We will use GridsearchCV for finding the best parameters for Logisitic Regression which has two parameters lambda and Penalty(Regularizer).
  3. RandomsearchCV is used for RandomForest model. RandomForest Model has many hyper parameters, we will work with three of them ,number of estimators, maximum depth and min_samples_leaf
  4. One other model we will work with is the LightGBM classifier. For this we will use Hyperopt for hyper parameter tuning. This model is also tree based hence we have parameters like boosting,max_depth, min_data_in_leaf, num_leaves and others
  5. For all the model the metric that we all be using is roc-auc score .
  6. In all the above three models we will pass the class weights as a strategy for resolving class imbalance. In Logisitc regression and Random forest we have the parameter ‘class_weight’ and in LightGBM we have scale_pos_weight .
  7. Cross validation is another parameter that we need to pay importance during Hyper parameter tuning. Cross validation is a process by which we maximize the usage of the training dataset. We make n-splits of the training dataset . From these n sets we make 1 set as a test set. We pass the n-1 datasets as training dataset and n-th dataset as test dataset. Every time the model is processed metric is recorded when using the nth dataset and is reported as the cross-validation score. Random forest has a special name for this called out-of-bag(OOB) score . This score is an indication of how the model will generalize when used in the real world. In other words, this gives us an indication as to how the model will perform when it encounters test or unseen data when the model is productionized. In our case study we need special processing for generating data for cross-validation. Most of the supplemental dataset have one to many relation ship . Due to this we need to ensure the all rows of data that belongs to one primary key stay in the same fold of the dataset. The other factor that we need to consider is to maintain the distribution of the class label the way it is in the original dataset. Sklearn does not have the functionality that combines the above two requirements, that is grouping and stratification. I wrote code to implement this functionality in the method called strtfied_grp_kfld in th class MLModel_data . We group the data using the ID and the target variable. We count the number of rows available in the minority class. We divide the number of rows of the minority class across the number of folds of data that we want to create. Using the compute_class_weight function we already know the target variable distribution in the parent dataset. Using this information we randomly sample the required number of rows as we want of the majority class from the groups made earlier. The CV parameter of the hyper parameter tuner’s can accept either a number or list of lists of train and test indices. If we have three fold cross validation then the value passed to CV parameter will look like this [[train1, test1], [train2, test2], [train3,test3]]. We also need to look at the train and validation error to evaluate our Bias-Variance trade-off. If your training error is high then we have a high bias model which leads to under-fitting. If our validation error is greater than our training error then we have a high variance model that leads to over-fitting. Our validation error should be as close to the training error, or in other words the validation error should mimic the training error. At the end of this process of hyper parameter tuning we get results like the optimum values for the hyper parameters, best estimator of the model and also the cross validation scores.
Best Estimator For Logistic Regression
a)Best Estimator For Logistic Regression
Best Estimator for Logistic Regression when using Text data
b)Best Estimator for Logistic Regression when using Text data
Pipeline for processing Text data with multiple TFIDF vectorizers
c)Pipeline for processing Text data with multiple TFIDF vectorizers
Best Estimator for Random Forest
d)Best Estimator for Random Forest

We will pass the test or unseen data through the various pipelines for pre-processing, feature generation and OHE generation of the respective datasets and finally use the above best estimators for predicting the class labels and the predictproba functions to predict the probabilities.

Metric Calculation

Now starts the hard part of measuring how good our models . The metric we would be using is the ROC-AUC-Score . This metric takes the class labels and the predict probabilities and generates a score. Since, this being a Kaggle dataset we don’t have the class labels for the unseen data we can generate this score for the training dataset only. We will submit the predict probabilities for the test data to Kaggle and figure out the Leaderboard rankings.

The various pipelines , cross -validation scores and the calculated predicted probabilities can be stored on disk for future usage.

Super Learner

From the supplemental datasets we have processed various models and generated probabilities. We combine all these probabilities and use them as features to another model. We will use Logistic regression as the model and try to arrive at probabilities. This model is now called the Super learner and supposed to produce better results.

10)Results

a)Logistic Regression — Results

Logistic Regression results
Logistic Regression results
Logistic Regression cross-validation results
Logistic Regression cross-validation results

b)LightGBM — Results

LightGBM Classifier results
LightGBM Classifier results

c)Random Forest — Results

Random forest results
Random forest results
Random forest Hyper parameter values
Random forest Hyper parameter values

d)Super Learner — Results

Super learner — Logistic Regression Model results
Super learner — Logistic Regression Model results
Super learner — Logistic Regression Model cross validation scores
Super learner — Logistic Regression Model cross validation scores

e)Feature Importance of Bureau_balance.csv

From the y-axis we can read off the patterns that are associated with default
From the y-axis we can read off the patterns that are associated with default
Here the y-axis shows a (1,1) gram that is associated with default.
Here the y-axis shows a (1,1) gram that is associated with default.

Final Model Recommendation

Best Dataset to use Current Application train and Current Application test

Best Algorithm is Random Forest with ROC AUC train Score of 0.8687578, and Cross Validation Score of 0.74142

Second Best Algorithm is Light GBM ROC AUC train Score of 0.7859, and Cross Validation Score of 0.750908

Kaggle Submission

Report on Kaggle submission
Report on Kaggle submission

11) Conclusion and Future Work

This is an im-balanced dataset, hence used class weights in the three algorithms

To address the data leak following measures have been undertaken

  1. Ensured that data occurring three years and beyond are in train and data within three years in test
  2. Implemented custom code to ensure that same ID does not occur across the various validation folds. The ratio of the target variable is maintained across the folds. The closely occurring results are proof that this custom code is working
  3. All time related feature have been dropped from the dataset.
  4. Usage of Pipeline so that actions on train dataset are broadcast-ed to the test
  5. I have used only positively correlated features with the target
  6. For Logistic regression I have used log-arithimetic values as my search space for Hyper-parameter tuning

a)Feature- Rate Of Interest

  1. Rate of interest which was not part of the original dataset was caclulated and it is occurring in the TOP 6 features for LightGBM and Random Forest models.
  2. The Rate of interest feature appears in both the GAIN and SPLIT type of feature importances

b)Future Work

  1. I have used class weights for class im-balance, we can try to use SMOTE or ADSYN for over-sampling the minority class and process the models.
  2. More data means better results, due to low compute power I have used only 80K rows that can be increased and performance measured.
  3. More feature engineering like polynomial and trignometric features can be added.
  4. Group all data according to the primary key SK_ID_CURR and create a monolithic dataset for processing.
  5. I have used only positively co-related features , we can add the negatively co-related features and check processing.
  6. There is lot of skewness in the data. We can create a completely new dataset which has skew corrected all the features. These features can be modelled.

The teachings of my Guru:

Read Mathematics formulae as poetry
-- Srikanth Varma

Data Science is like story telling, hope you find my story interesting

Plot: Classification

Actor: Modelling Algorithms

Actress: Features

Director: Python

Villains: Outliers, missing values

Others: Train/Test split, cross-validation, import-export, roc-auc score

Police: Bias & Variance Trade off

Comedians: Skewness, Data Distributions

Music: Sklearn

Best romantic song: Duet of Training and Validation error

Best sad song: Data Leakage

Script writer: OOPS, Pipelines, Transformers

Stunt director: Hyper paramater tuners

Cinematography: Seaborn, matplotlib

Financier: Computing Power

12) Profile

1. https://github.com/ariyurjana/final_case_study_1 For EDA of all the seven supplemental datasets and complete code

13)References

  1. https://www.appliedaicourse.com/
  2. https://www.kaggle.com/c/home-credit-default-risk
  3. https:/stackoverflow.com
  4. https://datascience.stackexchange.com/
  5. www.google.co.in
  6. www.github.com

--

--

Janardhanan a r

In the making Machine Learner programmer music lover