Personal Cancer Daignosis

Problem statement :

Janardhanan a r

11 min readApr 15, 2021

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

Data

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/

Data: Thanks to Memorial Sloan Kettering Cancer Center (MSKCC) for curating this dataset a great help to humanity.

Business Objective and Constraints

There is no Latency requirement
High accuracy is required as we are dealing with human lives
Errors are very costly
Interpretability is very important if the AI/ML model needs to be accepted by the Medical Community. The medical community needs to know how the model arrived at the classification rather than being a black-box.
This being a multi-class classification , the model needs to output probability scores which will help the doctor’s in case of ambiguity

Data Overview

There are two files that contain data.

The files have a common column ID referring to the patient ID.The file training_variants.txt contains the patient ID, gene information, variations and class label. The other file training_text contains the patient ID, and the complete medical record of the patient.

Class Label

There are nine different genetic mutations that are possible. Our class label will have nine distinct values.

Medical Record

The patient would have gone through a battery of tests . This field is going to have details of all those test , transcribed in plain english . for example, if a blood test was conducted on the patient the results of that blood test would be transcribed into english and stored here. This transcription would be at the micro level or every small nitty gritty detail of that test would be included in this field. Every medical test ever conducted on the patient would be converted to englist at the most granular level and stored in this field.

Setting up the ML Problem

Objective:

Predict the class label and probability for each data point belonging to the nine uique classes

Our class label has nine distinct value, hence this turns out into a Multi-class classification problem

Model Evaluation

This being classification we will have to draw the Confusion matrix
Multi-class log loss

Constraints

There is no Latency requirement
High accuracy is required as we are dealing with human lives
Errors are very costly
Interpretability is very important if the AI/ML model needs to be accepted by the Medical Community. The medical community needs to know how the model arrived at the classification rather than being a black-box.
This being a multi-class classification , the model needs to output probability scores which will help the doctor’s in case of ambiguity

Performance Metrics

Confusion matrix
Multi-class Logloss

Data Discovery

data = pd.read_csv('training_variants')
print('Number of data points : ', data.shape[0])
print('Number of features : ', data.shape[1])
print('Features : ', data.columns.values)
data.head()

Training_variants is a comma separated file containing the description of the genetic mutations .

The fields are:

ID : the id of the row used to link the mutation to the clinical evidence
Gene : the gene where this genetic mutation is located
Variation : the aminoacid change for this mutations
Class : 1–9 the class label this genetic mutation has been classified as

# note the seprator in this file
data_text =pd.read_csv("training_text",sep="\|\|",engine="python",names=["ID","TEXT"],skiprows=1)
print('Number of data points : ', data_text.shape[0])
print('Number of features : ', data_text.shape[1])
print('Features : ', data_text.columns.values)
data_text.head()

Preprocessing of Text

We will perform some natural language processing , nothing fancy just simple ones like:

repalce special character a-zA-Z0–9 with blank spaces
replace multiple spaces with single space
convert all characters to lower case
drop all stop words

We perform the above tasks using regular expressions . The stop words removal is from the NLTk library

from nltk.corpus import stopwords# loading stop words from nltk library
stop_words = set(stopwords.words('english'))def nlp_preprocessing(total_text, index, column):
    if type(total_text) is not int:
        string = ""
        # replace every special char with space
        total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)
        # replace multiple spaces with single space
        total_text = re.sub('\s+',' ', total_text)
        # converting all the chars into lower-case.
        total_text = total_text.lower()
        
        for word in total_text.split():
        # if the word is a not a stop word then retain that word     #from the data
            if not word in stop_words:
                string += word + " "
        
        data_text[column][index] = string#text processing stage.
start_time = time.clock()
for index, row in data_text.iterrows():
    if type(row['TEXT']) is str:
        nlp_preprocessing(row['TEXT'], index, 'TEXT')
    else:
        print("there is no text description for id:",index)

Let’s merge the gene_variation and medical records text data using ID column

#merging both gene_variations and text data based on ID
result = pd.merge(data, data_text,on='ID', how='left')
result.head()

Output of dataset after merging gene_variation and text data

Next will be some data cleaning task

result[result.isnull().any(axis=1)]

Wherever the TEXT column does not have data let’s concatenate the Gene and Variation column and store it back into the TEXT column

We will also replace blank strings with underscores for the Gene and Variations column.

result.loc[result['TEXT'].isnull(),'TEXT'] = result['Gene'] +' '+result['Variation']
result.Gene      = result.Gene.str.replace('\s+', '_')
result.Variation = result.Variation.str.replace('\s+', '_')

As always the first thing to do would be split the entire dataset into train,cross-validation and test datasets. We will use sklearn’s train_test_split function to do this.

The CLASS column is the ground truth hence we will seperate it out from the main dataset and then split the dataset into train and test.

y_true = result['Class'].values

Train, Cross-validation, Test

The following splits the dataset into train and test

# split the data into test and train by maintaining same #distribution of output varaible 'y_true' [stratify=y_true]
X_train, test_df, y_train, y_test = train_test_split(result, y_true, stratify=y_true, test_size=0.2)

We use the X_train and y_train and further split into train and cross-validation

# split the train data into train and cross validation by #maintaining same distribution of output varaible 'y_train'       #[stratify=y_train]
train_df, cv_df, y_train, y_cv = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2)

Let’s print shape of the datasets

print('Number of data points in train data:', train_df.shape[0])
print('Number of data points in test data:', test_df.shape[0])
print('Number of data points in cross validation data:', cv_df.shape[0])

Shapes of Train,test and cross-validation

Next we need to check the distribution of class labels across the three datasets

Distribution of class labels

Common Function to draw the bar graph for class labels

def clasdistrbn(df1,df2,hdng):
    """
    draws a bar graph for the series data present in df1
    """
    my_colors = 'rgbkymc'
    df1.plot(kind='bar',color=list(my_colors))
    plt.xlabel('Class')
    plt.ylabel('Data points per Class')
    plt.title(hdng)
    plt.grid()
    plt.show()
    sorted_yi = np.argsort(-df1.values)
    for i in sorted_yi:
        print('Number of data points in class', i+1, ':',df1.values[i], '(', np.round((df1.values[i]/df2.shape[0]*100), 3), '%)')

The code to call the above common function.

Create the class distribution datasets

# it returns a dict, keys as class labels and values as the number of data points in that class
train_class_distribution = train_df['Class'].value_counts().sort_index()
test_class_distribution = test_df['Class'].value_counts().sort_index()
cv_class_distribution = cv_df['Class'].value_counts().sort_index()

Train

clasdistrbn(train_class_distribution,train_df,'Distribution of yi in train data')

Test

clasdistrbn(test_class_distribution,test_df,'Distribution of yi in test data')

Cross Validation

clasdistrbn(cv_class_distribution,cv_df,'Distribution of yi in cross validation data')

Univariate Analysis on Features of the Dataset

Gene Feature

Distribution and total count of Gene in the three datasets

Variation Feature

Distribution and count of Variations in the three datasets

Data Preparation

Featurization for the Gene column

We will first create a dictionary which contains the probability array for each gene/variation

# get_gene_ft_dict: Get Gene varaition Feature Dict
 def get_gv_fea_dict(self,alpha, feature, df):     
     value_count = df[feature].value_counts()
     #print(value_count)
     # gv_dict : Gene Variation Dict, which contains the probability array for each gene/variation
     self.gv_dict = dict()
     
     # denominator will contain the number of time that particular feature occured in whole data
     for i, denominator in value_count.items():
         # vec will contain (p(yi==1/Gi) probability of gene/variation belongs to perticular class
         # vec is 9 diamensional vector
         vec = []
         #this gives the len of unique classes in the data frame
         j = len(df.groupby('Class').nunique())
         self.no_of_clas= j
         j +=1
         for k in range(1,j):cls_cnt = df.loc[(df['Class']==k) & (df[feature]==i)]
             #print(cls_cnt.shape[0],"********************")
             # cls_cnt.shape[0](numerator) will contain the number of time that particular feature occured in whole data
             vec.append((cls_cnt.shape[0] + alpha)/ (denominator + j*alpha))
             #print(vec)
         # we are adding the gene/variation to the dict as key and vec as value
         self.gv_dict[i]=vec
     return self.gv_dict

For every feature values in the given data frame we will check if it is there in the train data then we will add the feature to gv_fea
If it does not exist we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_fea list

# Get Gene variation feature
 def get_gv_feature(self,alpha, feature, df):gv_dict = self.get_gv_fea_dict(alpha, feature, df)
     # value_count is similar in get_gv_fea_dict
     value_count = df[feature].value_counts()
     
     # gv_fea: Gene_variation feature, it will contain the feature for each feature value in the data
     self.gv_fea = []
     
     # for every feature values in the given data frame we will check if it is there in the train data then we will add the feature to gv_fea
     
     # if not we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_fea
     for index, row in df.iterrows():
         if row[feature] in dict(value_count).keys():
             self.gv_fea.append(gv_dict[row[feature]])
         else:
             self.gv_fea.append([1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9])
     
     return self.gv_fea

By passing “Variant”, as the value for the feature parameter the above two functions can be re-used to featurize the variant column in our dataset.

Featurization for TEXT column

First we create a response coding for this TEXT column.

The other feature we create will be using the TFIDFVectorizer on the TEXT column

def text_ft_conv(self):
  #response coding of text features
  self.cls_lbl_dict()
  alpha=10
  # dict_list[i] is build on i'th  class text data
  # total_dict is buid on whole training text data
  self.total_dict = self.crea_dict_frmtxt(self.x_trn)self.Xtest_txt_ft_respCdg  = self.get_text_responsecoding(self.X_test,alpha)
  
  self.xtrain_txt_ft_respCdg  = self.get_text_responsecoding(self.x_trn,alpha)
  self.cval_txt_ft_respCdg  = self.get_text_responsecoding(self.x_cval,alpha)# we convert each row values such that they sum to 1  
  self.Xtest_txt_ft_respCdg = (self.Xtest_txt_ft_respCdg.T/self.Xtest_txt_ft_respCdg.sum(axis=1)).T
  self.xtrain_txt_ft_respCdg = (self.xtrain_txt_ft_respCdg.T/self.xtrain_txt_ft_respCdg.sum(axis=1)).T  
  self.cval_txt_ft_respCdg = (self.cval_txt_ft_respCdg.T/self.cval_txt_ft_respCdg.sum(axis=1)).T#one hot encodings for text
  # building a TfidfVectorizer with all the words that occured minimum 3 times in train data
  self.txt_vectrzr_trntst = TfidfVectorizer(min_df=3)self.xtrain_txt_ft_1hotCdg = self.txt_vectrzr_trntst.fit_transform(self.x_trn['TEXT'])
  self.xtrain_txt_ft_1hotCdg = normalize(self.xtrain_txt_ft_1hotCdg, axis=0)# getting all the feature names (words)
  self.xtrain_txt_ft= self.txt_vectrzr_trntst.get_feature_names()
    
  # we use the same vectorizer that was trained on train data
  self.Xtest_txt_ft_1hotCdg = self.txt_vectrzr_trntst.transform(self.X_test['TEXT'])
  # don't forget to normalize every feature
  self.Xtest_txt_ft_1hotCdg = normalize(self.Xtest_txt_ft_1hotCdg, axis=0)# we use the same vectorizer that was trained on train data
  self.xcval_txt_ft_1hotCdg = self.txt_vectrzr_trntst.transform(self.x_cval['TEXT'])
  # don't forget to normalize every feature
  self.xcval_txt_ft_1hotCdg = normalize(self.xcval_txt_ft_1hotCdg, axis=0)
  print("train_txt_feature_onehotCoding . shape of gene :", self.xtrain_txt_ft_1hotCdg.shape)
  print("cval_txt_feature_onehotCoding . shape of gene :", self.xcval_txt_ft_1hotCdg.shape)

At this stage we will have the response and one-hot encoding for the three features(Gene,Variant and Text) in the dataset.

Stacking these features to make the final dataset that can be input into the model.

def stack_3_features(self):   print('into stacking features')   self.X_test_gn_vrtn_1hotcdg  =     hstack([self.Xtest_gn_ft_1hotCdg,self.Xtest_vrtn_ft_1hotCdg])
  self.x_train_gn_vrtn_1hotcdg = hstack([self.train_gn_ft_1hotCdg,self.xtrain_vrtn_ft_1hotCdg])
  self.x_cval_gn_vrtn_1hotcdg = hstack([self.cval_gn_ft_1hotCdg,self.cval_vrtn_ft_1hotCdg])  self.X_test_1hotCdg  = hstack(([self.X_test_gn_vrtn_1hotcdg,self.Xtest_txt_ft_1hotCdg]))
  self.y_test =  np.array(list(self.X_test['Class']))self.x_train_1hotCdg = hstack(([self.x_train_gn_vrtn_1hotcdg,self.xtrain_txt_ft_1hotCdg]))
  self.y_trn =   np.array(list(self.x_trn['Class']))self.x_cval_1hotCdg= hstack(([self.x_cval_gn_vrtn_1hotcdg,self.xcval_txt_ft_1hotCdg]))
  self.y_cval = np.array(list(self.x_cval['Class']))#response coding
  X_test_gn_vrtn_respCdg = np.hstack([self.Xtest_gn_ft_respCdg,self.Xtest_vrtn_ft_respCdg])
  x_trn_gn_vrtn_respCdg = np.hstack([self.trn_gn_ft_respCdg,self.xtrain_vrtn_ft_respCdg])
  x_cval_gn_vrtn_respCdg = np.hstack([self.cval_gn_ft_respCdg,self.cval_gn_ft_respCdg])  self.X_test_respCdg = np.hstack([X_test_gn_vrtn_respCdg,self.Xtest_txt_ft_respCdg])
  self.x_trn_respCdg = np.hstack([x_trn_gn_vrtn_respCdg,self.xtrain_txt_ft_respCdg])
  self.x_cval_respCdg = np.hstack([x_cval_gn_vrtn_respCdg,self.cval_txt_ft_respCdg])print("One hot encoding features :")
  print("(number of data points * number of features) in test data = ", self.X_test_1hotCdg.shape)
  print("(number of data points * number of features) in train data = ", self.x_train_1hotCdg.shape)
  print("(number of data points * number of features) in cross validation data =", self.x_cval_1hotCdg.shape)print("Response encoding features :")
  print("(number of data points * number of features) in test data = ", self.X_test_respCdg.shape)
  print("(number of data points * number of features) in train data = ", self.x_trn_respCdg.shape)
  print("(number of data points * number of features) in cross validation data =", self.x_cval_respCdg.shape)

I have used the above data in seven different Machine Learning models Naive Bayes, K-Nearest Neighbour, Logistic Regression, Linear SVM, Random Forest, Stacking Classifier and Maximum Voting Classifier.

While using TFIDF Vectorizer I created two different dataset :-

included only those words that occured minimum three times in the TEXT column and all features of the vectorizer ( option 1 )
included only the first 1000 features (option 2)

In this post we will talk about using Naive Bayes algorithm on the dataset created from option 1.

NaiveBayes

Probabilistic Model
ASSUMPTIOM — Conditional independence of Features

Conditional Independence — assumption of Naive Bayes

Training and Test Stage

Test Stage or Runtime

Train / Runtime conclusion

Training stage we are storing only ratios/likelihoods
Runtime is much more memory efficient

Bias -Variance Trade off

High Bias → underfitting

High Variance → overfitting

There is only one parameter alpha (that’s part of Laplace smoothing) that determines under/over fitting

Case 1 alpha=0

Small changes in D-train results in large changes to the model. This large model change is termed as high variance which leads to overfitting

Case 2 alpha is large

In this case we will get into a situation of underfitting which leads to high bias.

We find the right alpha either using simple or 10-fold cross validation

Naive Bayes Implementation

Our dataset is Multinomial hence we will use Sklearns’s MultinomialNaiveBayes which has the following signature

class sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)

Naive Bayes Calling Code

Code to perform Hyper-parameter tuning

naibay.mnbayes_hyperparm_tuning(cancdiag.x_train_1hotCdg, cancdiag.x_cval_1hotCdg, cancdiag.y_trn, cancdiag.y_cval)

Code that uses the Hyper-parameter tuned values

naibay.mnbayes_using_hyperparm(cancdiag.X_test_1hotCdg,cancdiag.x_train_1hotCdg,cancdiag.x_cval_1hotCdg,cancdiag.y_test,cancdiag.y_trn,cancdiag.y_cval)

Code for testing the Naive Bayes Classifier

test_pt_idx = 4
no_of_ft = 100
naibay.clasifier_testing(cancdiag.y_test,test_pt_idx,no_of_ft)
 naibay.get_impfeature_names(cancdiag.X_test,cancdiag.X_test['TEXT'].iloc[test_pt_idx],cancdiag.X_test['Gene'].iloc[test_pt_idx],cancdiag.X_test['Variation'].iloc[test_pt_idx],cancdiag.gene_feat_names , cancdiag.vrtn_feat_names,cancdiag.xtrain_txt_ft,no_of_ft)

Naive Bayes Hyper-parameter tuning output

Code for Confusion , Precision and Recall matrices

confsn_mtx = load_data(dirname,'/mnbayesop/confsn_mtx')
precsn_mtx = load_data(dirname,'/mnbayesop/precsn_mtx')
recall_mtx = load_data(dirname,'/mnbayesop/recall_mtx')
labels = load_data(dirname,'/mnbayesop/clas_lbls')print("The cross validation log loss is:",cval_log_los)
no_mis_clf_pts = load_data(dirname,'/mnbayesop/hptun_numisclspts')print("Number of missclassified point :",no_mis_clf_pts)
# representing A in heatmap format
print("-"*20, "Confusion matrix", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(confsn_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()print("-"*20, "Precision matrix (Columm Sum=1)", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(precsn_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()# representing B in heatmap format
print("-"*20, "Recall matrix (Row sum=1)", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(recall_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()

Conclusion

Naive Bayes Classifier Inference and Conclusion

Percentage of mis-classified points is 43 percent hence for this dataset with these set of features Naive Bayes is not performing very well
From the confusion matrix we can see that the classifier is not able to differentiate other labels from Class 3 inspite of Class 3 having very less amount of data points

The complete code is available at https://github.com/ariyurjana/Personalized_Cancer_diagnosis

The code only for Naive Bayes is avialable at

https://github.com/ariyurjana/Personalized_Cancer_diagnosis/blob/final_case_study_1/PersCancDiag_NaiveBayes.ipynb

EDIT:

Code snippets I have included are not getting formatted properly hence I have tried to display code as images . Need to find a better way for displaying code.
the ipython notebook for the individual models will take some time to get upload to Github, sorry for this inconvenience