Lend and still be Profitable…

Objective

Table of Contents

1)Introduction

2) Business problem

a) Use Case

b)Machine learning models

3)Mapping to ML problem

4) Introduction to Datasets

5) Existing approaches

6) First Cut Approach

Formulae to calculate Rate of Interest
Formulae to calculate Rate of Interest
Formulae to calculate Rate of Interest

7) Exploratory Data Analysis (EDA)

Class Im-balance
Class Im-balance
Class Imbalance
This table displays the measures of Central limit theorem (CLT)
This table displays the measures of Central limit theorem (CLT)
Displays Percentiles
skewness of data
Displays kurtosis of the dataset
Displays kurtosis of the dataset
Kurtosis of the dataset
Spread of the data
Spread of the data
Spread of the data
Kernel Density Estimate Plot for Amount Credit in application_train.csv
Kernel Density Estimate Plot for Amount Credit in application_train.csv
Kernel Density Estimate Plot for Amount Credit in application_train.csv
Kernel Density Estimate Plot for Amount Goods Price in application_train.csv
Bar graph for categorical variable NAME_INCOME_TYPE. Surprisingly students are not DEFAULTERS.
Bar graph for categorical variable NAME_INCOME_TYPE. Surprisingly students are not DEFAULTERS.
Bar graph for categorical variable NAME_INCOME_TYPE. Surprisingly students are not DEFAULTERS.
Bar graph for categorical variable ORGANIZATION_TYPE from application_train.csv
Bar graph for categorical variable ORGANIZATION_TYPE from application_train.csv
Bar graph for categorical variable ORGANIZATION_TYPE from application_train.csv
Multivariate graphical analysis — displays scatter plot between AMOUNT_ANNUITY and AMOUNT_CREDIT from application_train.csv
Multivariate graphical analysis — displays scatter plot between AMOUNT_ANNUITY and AMOUNT_CREDIT from application_train.csv
Multivariate graphical analysis — displays scatter plot between AMOUNT_ANNUITY and AMOUNT_CREDIT from application_train.csv
Display co-relation values between features and class label
Display co-relation values between features and class label
Display co-relation values between features and class label

8)Feature engineering

class Xfrmer_rateofinterest(BaseEstimator,TransformerMixin):
"""
This calculates rate of interest should be one of the GOLDEN FEATURES
Lets wait and watch
"""
# constructor
def __init__(self):
#we are not going to use this
self.rteofintrst = None
#Return self
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
#rate of interest calculation I hope this will be a stellar feature let see
#the formula is numerator = 6(a-n)
# denominator = square(n) + (5+2a)n - 2a
# RATEOFINTEREST = numerator / denominator
n =X["CNT_PAYMENT"]
anuity = X["AMT_ANNUITY"]
numerator = 6 * (anuity - n)
denominator = np.square(n) + (5+2*anuity)*n - 2*anuity
rteofintrst = (numerator / denominator).replace((np.inf, -np.inf), (-9, -9))
X = rteofintrst.values.reshape(-1,1)
X[np.isnan(X)] = 0
print('rteint shape',X.shape)
return Xclass Xfrmer_rteofintrst(BaseEstimator,TransformerMixin):
"""
This calculates rate of interest should be one of the GOLDEN FEATURES
Lets wait and watch
"""
# constructor
def __init__(self):
#we are not going to use this
self.rteofintrst = None
#Return self
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
#rate of interest used by other competitors
#https://www.kaggle.com/c/home-credit-default-risk/discussion/64598
n =X["CNT_PAYMENT"]
anuity = X["AMT_ANNUITY"]
interst = (n*anuity) - X["AMT_CREDIT"]
numerator = 24 * interst
denominator =X["AMT_CREDIT"] * (n+1)
rteofintrst = (numerator / denominator).replace((np.inf, -np.inf), (-9, -9))
X = rteofintrst.values.reshape(-1,1)
X[np.isnan(X)] = 0
return X
```
Pipeline for adding and subtracting two columns
Pipeline for adding and subtracting two columns
Pipeline for adding and subtracting two columns
Pipeline for dividing two columns
Pipeline for dividing two columns
Pipeline for dividing two columns
def display_mising_value(self):
"""
fn that displays the missing values of the dataset
"""
dct_tmp = {}
#self.df_mis_val = pd.DataFrame((self.df_source.isnull().sum())*100/se
lf.df_source.shape[0], columns=['COLNAME','PCT']).reset_index()
dct_tmp["PCT"] = (self.df_source.isnull().sum())*100/self.df_source.sh
ape[0]
dct_tmp["SKEWNES"] = (self.df_source.skew().round(2))
#self.df_mis_val = pd.DataFrame((self.df_source.isnull().sum())*100/se
lf.df_source.shape[0], columns=['PCT'])
self.df_mis_val = pd.DataFrame(dct_tmp,columns=["PCT","SKEWNES"])
self.df_misval_lvl1 = pd.DataFrame(self.df_mis_val.loc[(self.df_mis_va
l["PCT"]< 5)])
self.df_misval_lvl2 = pd.DataFrame(self.df_mis_val.loc[(self.df_mis_va
l["PCT"] >= 5) & (self.df_mis_val["PCT"] < 20 )])
self.df_misval_lvl3 = pd.DataFrame(self.df_mis_val.loc[(self.df_mis_va
l["PCT"] >= 20) & (self.df_mis_val["PCT"] < 75)])
self.df_misval_lvl4 = pd.DataFrame(self.df_mis_val.loc[(self.df_mis_va
l["PCT"] >= 75)])
self.df_misval_lvl1.sort_values(by="PCT", axis=0,inplace=True)
self.df_misval_lvl2.sort_values(by="PCT", axis=0,inplace=True)
self.df_misval_lvl3.sort_values(by="PCT", axis=0,inplace=True)
self.df_misval_lvl4.sort_values(by="PCT", axis=0,inplace=True)
Missing value percentage between 5 and 20 percent
Missing value percentage between 5 and 20 percent
Missing value percentage between 20 and 75 percent
Missing value percentage between 20 and 75 percent
Missing value percentage greater than 75 percent
mdn =X.loc[(X['FLAG_OWN_CAR'] == 'Y'),'OWN_CAR_AGE'].median()   X.loc[(X['FLAG_OWN_CAR'] == 'Y' ) & (X['OWN_CAR_AGE'].isna()),'OWN_CAR _AGE'] = mdn  X.loc[(X['FLAG_OWN_CAR'] == 'N' ) & (X['OWN_CAR_AGE'].isna()),'OWN_CAR _AGE'] = 0#name_income_type='Commercial associate' and occupation_type  cat_mode =X.loc[(X['NAME_INCOME_TYPE'] == 'Commercial associate'),'OCC UPATION_TYPE'].value_counts().index[0]  )  X.loc[(X['NAME_INCOME_TYPE'] == 'Civil marriage' ) & (X['OCCUPATION_TY PE'].isna()),'OCCUPATION_TYPE'] = cat_mode
column transformer with simple Imputer
column transformer with simple Imputer

9) Model Explanation

Dataset with class label
Dataset with class label
Dataset with class label
UML for classes
Missing value percentage greater than 75 percent
column transformer with simple Imputer
column transformer with simple Imputer
Column Transformer with Custom Transformer for Data Cleaning
class Xfrmer_replacenp(BaseEstimator, TransformerMixin):
"""
this transformer does the global repplace within the dataframe
replace 365243 spcific to this case study with 0
replace +/-inf , nan with zero
"""
# constructor
def __init__(self):
#we are not going to use this
self._features = None

def fit(self, X,y=None ):
return self
def transform(self,X,y=None):
X[X==365243.0] = 0
X[X=="XAP"] = 0
X[X=="XNA"] = 0
X[X=="nan"] = 0
print('all replace',X.shape)
return X
if tmpval.dtype == 'object':
tmpval= pd.DataFrame(tmpval)
tmpval.fillna(0,inplace=True)
X = tmpval
Pipeline for converting categorical to numeric using one-hot encoding
Pipeline for converting categorical to numeric using one-hot encoding
Pipeline for converting categorical to numeric using one-hot encoding
col_lvl_ft_names = []
ft_union = crdcrd_ft_gen_piplin['engineer_data']
tpl_xfrmr_lst =ft_union.transformer_list
catpipe = tpl_xfrmr_lst[22][1]
ccohe = catpipe.named_steps['CC_OHE']
#print(type(ccohe))
print(len(ccohe.categories_))
lst_ohe_ft_name = ccohe.categories_[0]
print('Generating column names for numerical features...')
for xfrmr_lvl in range(len(tpl_xfrmr_lst)-1):
col_lvl_ft_names.append(tpl_xfrmr_lst[xfrmr_lvl][0])
#add OHE categories as column names
print('Generating column names for categorical features...')
for itm in lst_ohe_ft_name :
col_lvl_ft_names.append("CC_NMECTRCTSTAT_"+ itm.upper())
#recreate the dataframe with imputed column values
df_trn_colvl = pd.DataFrame(X_trn_xfrm,columns=col_lvl_ft_names)
df_tst_colvl = pd.DataFrame(x_tst_xfrm,columns=col_lvl_ft_names)

#concatenate the two dataframes to the original dataframe
dtprcs.X_train = pd.concat([dtprcs.X_train,df_trn_colvl], axis=1)
dtprcs.x_test = pd.concat([dtprcs.x_test,df_tst_colvl], axis=1)
Best Estimator For Logistic Regression
Best Estimator For Logistic Regression
a)Best Estimator For Logistic Regression
Best Estimator for Logistic Regression when using Text data
Best Estimator for Logistic Regression when using Text data
b)Best Estimator for Logistic Regression when using Text data
Pipeline for processing Text data with multiple TFIDF vectorizers
Pipeline for processing Text data with multiple TFIDF vectorizers
c)Pipeline for processing Text data with multiple TFIDF vectorizers
Best Estimator for Random Forest
Best Estimator for Random Forest
d)Best Estimator for Random Forest

10)Results

Logistic Regression results
Logistic Regression results
Logistic Regression results
Logistic Regression cross-validation results
Logistic Regression cross-validation results
Logistic Regression cross-validation results
LightGBM Classifier results
LightGBM Classifier results
LightGBM Classifier results
Random forest results
Random forest results
Random forest results
Random forest Hyper parameter values
Random forest Hyper parameter values
Random forest Hyper parameter values
Super learner — Logistic Regression Model results
Super learner — Logistic Regression Model results
Super learner — Logistic Regression Model results
Super learner — Logistic Regression Model cross validation scores
Super learner — Logistic Regression Model cross validation scores
Super learner — Logistic Regression Model cross validation scores
From the y-axis we can read off the patterns that are associated with default
From the y-axis we can read off the patterns that are associated with default
From the y-axis we can read off the patterns that are associated with default
Here the y-axis shows a (1,1) gram that is associated with default.
Here the y-axis shows a (1,1) gram that is associated with default.
Here the y-axis shows a (1,1) gram that is associated with default.
Report on Kaggle submission
Report on Kaggle submission
Report on Kaggle submission

11) Conclusion and Future Work

Read Mathematics formulae as poetry
-- Srikanth Varma

12) Profile

13)References

In the making Machine Learner programmer music lover