Personal Cancer Daignosis

Problem statement :

Janardhanan a r
11 min readApr 15, 2021

--

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

Data

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/

Data: Thanks to Memorial Sloan Kettering Cancer Center (MSKCC) for curating this dataset a great help to humanity.

Business Objective and Constraints

  • There is no Latency requirement
  • High accuracy is required as we are dealing with human lives
  • Errors are very costly
  • Interpretability is very important if the AI/ML model needs to be accepted by the Medical Community. The medical community needs to know how the model arrived at the classification rather than being a black-box.
  • This being a multi-class classification , the model needs to output probability scores which will help the doctor’s in case of ambiguity

Data Overview

There are two files that contain data.

The files have a common column ID referring to the patient ID.The file training_variants.txt contains the patient ID, gene information, variations and class label. The other file training_text contains the patient ID, and the complete medical record of the patient.

Class Label

There are nine different genetic mutations that are possible. Our class label will have nine distinct values.

Medical Record

The patient would have gone through a battery of tests . This field is going to have details of all those test , transcribed in plain english . for example, if a blood test was conducted on the patient the results of that blood test would be transcribed into english and stored here. This transcription would be at the micro level or every small nitty gritty detail of that test would be included in this field. Every medical test ever conducted on the patient would be converted to englist at the most granular level and stored in this field.

Setting up the ML Problem

Objective:

Predict the class label and probability for each data point belonging to the nine uique classes

Our class label has nine distinct value, hence this turns out into a Multi-class classification problem

Model Evaluation

  • This being classification we will have to draw the Confusion matrix
  • Multi-class log loss

Constraints

  • There is no Latency requirement
  • High accuracy is required as we are dealing with human lives
  • Errors are very costly
  • Interpretability is very important if the AI/ML model needs to be accepted by the Medical Community. The medical community needs to know how the model arrived at the classification rather than being a black-box.
  • This being a multi-class classification , the model needs to output probability scores which will help the doctor’s in case of ambiguity

Performance Metrics

  • Confusion matrix
  • Multi-class Logloss

Data Discovery

data = pd.read_csv('training_variants')
print('Number of data points : ', data.shape[0])
print('Number of features : ', data.shape[1])
print('Features : ', data.columns.values)
data.head()
output of training_variants

Training_variants is a comma separated file containing the description of the genetic mutations .

The fields are:

  • ID : the id of the row used to link the mutation to the clinical evidence
  • Gene : the gene where this genetic mutation is located
  • Variation : the aminoacid change for this mutations
  • Class : 1–9 the class label this genetic mutation has been classified as
# note the seprator in this file
data_text =pd.read_csv("training_text",sep="\|\|",engine="python",names=["ID","TEXT"],skiprows=1)
print('Number of data points : ', data_text.shape[0])
print('Number of features : ', data_text.shape[1])
print('Features : ', data_text.columns.values)
data_text.head()
output of training_text

Preprocessing of Text

We will perform some natural language processing , nothing fancy just simple ones like:

  • repalce special character a-zA-Z0–9 with blank spaces
  • replace multiple spaces with single space
  • convert all characters to lower case
  • drop all stop words

We perform the above tasks using regular expressions . The stop words removal is from the NLTk library

from nltk.corpus import stopwords# loading stop words from nltk library
stop_words = set(stopwords.words('english'))
def nlp_preprocessing(total_text, index, column):
if type(total_text) is not int:
string = ""
# replace every special char with space
total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)
# replace multiple spaces with single space
total_text = re.sub('\s+',' ', total_text)
# converting all the chars into lower-case.
total_text = total_text.lower()

for word in total_text.split():
# if the word is a not a stop word then retain that word #from the data
if not word in stop_words:
string += word + " "

data_text[column][index] = string
#text processing stage.
start_time = time.clock()
for index, row in data_text.iterrows():
if type(row['TEXT']) is str:
nlp_preprocessing(row['TEXT'], index, 'TEXT')
else:
print("there is no text description for id:",index)
output of pre-processing

Let’s merge the gene_variation and medical records text data using ID column

#merging both gene_variations and text data based on ID
result = pd.merge(data, data_text,on='ID', how='left')
result.head()
Output of dataset after merging gene_variation and text data

Next will be some data cleaning task

result[result.isnull().any(axis=1)]

Wherever the TEXT column does not have data let’s concatenate the Gene and Variation column and store it back into the TEXT column

We will also replace blank strings with underscores for the Gene and Variations column.

result.loc[result['TEXT'].isnull(),'TEXT'] = result['Gene'] +' '+result['Variation']
result.Gene = result.Gene.str.replace('\s+', '_')
result.Variation = result.Variation.str.replace('\s+', '_')

As always the first thing to do would be split the entire dataset into train,cross-validation and test datasets. We will use sklearn’s train_test_split function to do this.

The CLASS column is the ground truth hence we will seperate it out from the main dataset and then split the dataset into train and test.

y_true = result['Class'].values

Train, Cross-validation, Test

The following splits the dataset into train and test

# split the data into test and train by maintaining same #distribution of output varaible 'y_true' [stratify=y_true]
X_train, test_df, y_train, y_test = train_test_split(result, y_true, stratify=y_true, test_size=0.2)

We use the X_train and y_train and further split into train and cross-validation

# split the train data into train and cross validation by #maintaining same distribution of output varaible 'y_train'       #[stratify=y_train]
train_df, cv_df, y_train, y_cv = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2)

Let’s print shape of the datasets

print('Number of data points in train data:', train_df.shape[0])
print('Number of data points in test data:', test_df.shape[0])
print('Number of data points in cross validation data:', cv_df.shape[0])
Shapes of Train,test and cross-validation

Next we need to check the distribution of class labels across the three datasets

Distribution of class labels

Common Function to draw the bar graph for class labels

def clasdistrbn(df1,df2,hdng):
"""
draws a bar graph for the series data present in df1
"""
my_colors = 'rgbkymc'
df1.plot(kind='bar',color=list(my_colors))
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title(hdng)
plt.grid()
plt.show()
sorted_yi = np.argsort(-df1.values)
for i in sorted_yi:
print('Number of data points in class', i+1, ':',df1.values[i], '(', np.round((df1.values[i]/df2.shape[0]*100), 3), '%)')

The code to call the above common function.

Create the class distribution datasets

# it returns a dict, keys as class labels and values as the number of data points in that class
train_class_distribution = train_df['Class'].value_counts().sort_index()
test_class_distribution = test_df['Class'].value_counts().sort_index()
cv_class_distribution = cv_df['Class'].value_counts().sort_index()

Train

clasdistrbn(train_class_distribution,train_df,'Distribution of yi in train data')
Class distribution of training data

Test

clasdistrbn(test_class_distribution,test_df,'Distribution of yi in test data')
Class Distribution of test data

Cross Validation

clasdistrbn(cv_class_distribution,cv_df,'Distribution of yi in cross validation data')
Class Distribution of Cross Validation data

Univariate Analysis on Features of the Dataset

Gene Feature

Distribution and total count of Gene in the three datasets

Variation Feature

Distribution and count of Variations in the three datasets

Data Preparation

Featurization for the Gene column

We will first create a dictionary which contains the probability array for each gene/variation

# get_gene_ft_dict: Get Gene varaition Feature Dict
def get_gv_fea_dict(self,alpha, feature, df):
value_count = df[feature].value_counts()
#print(value_count)
# gv_dict : Gene Variation Dict, which contains the probability array for each gene/variation
self.gv_dict = dict()

# denominator will contain the number of time that particular feature occured in whole data
for i, denominator in value_count.items():
# vec will contain (p(yi==1/Gi) probability of gene/variation belongs to perticular class
# vec is 9 diamensional vector
vec = []
#this gives the len of unique classes in the data frame
j = len(df.groupby('Class').nunique())
self.no_of_clas= j
j +=1
for k in range(1,j):
cls_cnt = df.loc[(df['Class']==k) & (df[feature]==i)]
#print(cls_cnt.shape[0],"********************")
# cls_cnt.shape[0](numerator) will contain the number of time that particular feature occured in whole data
vec.append((cls_cnt.shape[0] + alpha)/ (denominator + j*alpha))
#print(vec)
# we are adding the gene/variation to the dict as key and vec as value
self.gv_dict[i]=vec
return self.gv_dict

For every feature values in the given data frame we will check if it is there in the train data then we will add the feature to gv_fea
If it does not exist we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_fea list

# Get Gene variation feature
def get_gv_feature(self,alpha, feature, df):
gv_dict = self.get_gv_fea_dict(alpha, feature, df)
# value_count is similar in get_gv_fea_dict
value_count = df[feature].value_counts()

# gv_fea: Gene_variation feature, it will contain the feature for each feature value in the data
self.gv_fea = []

# for every feature values in the given data frame we will check if it is there in the train data then we will add the feature to gv_fea

# if not we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_fea
for index, row in df.iterrows():
if row[feature] in dict(value_count).keys():
self.gv_fea.append(gv_dict[row[feature]])
else:
self.gv_fea.append([1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9])

return self.gv_fea

By passing “Variant”, as the value for the feature parameter the above two functions can be re-used to featurize the variant column in our dataset.

Featurization for TEXT column

First we create a response coding for this TEXT column.

The other feature we create will be using the TFIDFVectorizer on the TEXT column

def text_ft_conv(self):
#response coding of text features
self.cls_lbl_dict()
alpha=10
# dict_list[i] is build on i'th class text data
# total_dict is buid on whole training text data
self.total_dict = self.crea_dict_frmtxt(self.x_trn)
self.Xtest_txt_ft_respCdg = self.get_text_responsecoding(self.X_test,alpha)

self.xtrain_txt_ft_respCdg = self.get_text_responsecoding(self.x_trn,alpha)
self.cval_txt_ft_respCdg = self.get_text_responsecoding(self.x_cval,alpha)
# we convert each row values such that they sum to 1
self.Xtest_txt_ft_respCdg = (self.Xtest_txt_ft_respCdg.T/self.Xtest_txt_ft_respCdg.sum(axis=1)).T
self.xtrain_txt_ft_respCdg = (self.xtrain_txt_ft_respCdg.T/self.xtrain_txt_ft_respCdg.sum(axis=1)).T
self.cval_txt_ft_respCdg = (self.cval_txt_ft_respCdg.T/self.cval_txt_ft_respCdg.sum(axis=1)).T
#one hot encodings for text
# building a TfidfVectorizer with all the words that occured minimum 3 times in train data
self.txt_vectrzr_trntst = TfidfVectorizer(min_df=3)
self.xtrain_txt_ft_1hotCdg = self.txt_vectrzr_trntst.fit_transform(self.x_trn['TEXT'])
self.xtrain_txt_ft_1hotCdg = normalize(self.xtrain_txt_ft_1hotCdg, axis=0)
# getting all the feature names (words)
self.xtrain_txt_ft= self.txt_vectrzr_trntst.get_feature_names()

# we use the same vectorizer that was trained on train data
self.Xtest_txt_ft_1hotCdg = self.txt_vectrzr_trntst.transform(self.X_test['TEXT'])
# don't forget to normalize every feature
self.Xtest_txt_ft_1hotCdg = normalize(self.Xtest_txt_ft_1hotCdg, axis=0)
# we use the same vectorizer that was trained on train data
self.xcval_txt_ft_1hotCdg = self.txt_vectrzr_trntst.transform(self.x_cval['TEXT'])
# don't forget to normalize every feature
self.xcval_txt_ft_1hotCdg = normalize(self.xcval_txt_ft_1hotCdg, axis=0)
print("train_txt_feature_onehotCoding . shape of gene :", self.xtrain_txt_ft_1hotCdg.shape)
print("cval_txt_feature_onehotCoding . shape of gene :", self.xcval_txt_ft_1hotCdg.shape)

At this stage we will have the response and one-hot encoding for the three features(Gene,Variant and Text) in the dataset.

Stacking these features to make the final dataset that can be input into the model.

def stack_3_features(self):   print('into stacking features')   self.X_test_gn_vrtn_1hotcdg  =     hstack([self.Xtest_gn_ft_1hotCdg,self.Xtest_vrtn_ft_1hotCdg])
self.x_train_gn_vrtn_1hotcdg = hstack([self.train_gn_ft_1hotCdg,self.xtrain_vrtn_ft_1hotCdg])
self.x_cval_gn_vrtn_1hotcdg = hstack([self.cval_gn_ft_1hotCdg,self.cval_vrtn_ft_1hotCdg])
self.X_test_1hotCdg = hstack(([self.X_test_gn_vrtn_1hotcdg,self.Xtest_txt_ft_1hotCdg]))
self.y_test = np.array(list(self.X_test['Class']))
self.x_train_1hotCdg = hstack(([self.x_train_gn_vrtn_1hotcdg,self.xtrain_txt_ft_1hotCdg]))
self.y_trn = np.array(list(self.x_trn['Class']))
self.x_cval_1hotCdg= hstack(([self.x_cval_gn_vrtn_1hotcdg,self.xcval_txt_ft_1hotCdg]))
self.y_cval = np.array(list(self.x_cval['Class']))
#response coding
X_test_gn_vrtn_respCdg = np.hstack([self.Xtest_gn_ft_respCdg,self.Xtest_vrtn_ft_respCdg])
x_trn_gn_vrtn_respCdg = np.hstack([self.trn_gn_ft_respCdg,self.xtrain_vrtn_ft_respCdg])
x_cval_gn_vrtn_respCdg = np.hstack([self.cval_gn_ft_respCdg,self.cval_gn_ft_respCdg])
self.X_test_respCdg = np.hstack([X_test_gn_vrtn_respCdg,self.Xtest_txt_ft_respCdg])
self.x_trn_respCdg = np.hstack([x_trn_gn_vrtn_respCdg,self.xtrain_txt_ft_respCdg])
self.x_cval_respCdg = np.hstack([x_cval_gn_vrtn_respCdg,self.cval_txt_ft_respCdg])
print("One hot encoding features :")
print("(number of data points * number of features) in test data = ", self.X_test_1hotCdg.shape)
print("(number of data points * number of features) in train data = ", self.x_train_1hotCdg.shape)
print("(number of data points * number of features) in cross validation data =", self.x_cval_1hotCdg.shape)
print("Response encoding features :")
print("(number of data points * number of features) in test data = ", self.X_test_respCdg.shape)
print("(number of data points * number of features) in train data = ", self.x_trn_respCdg.shape)
print("(number of data points * number of features) in cross validation data =", self.x_cval_respCdg.shape)

I have used the above data in seven different Machine Learning models Naive Bayes, K-Nearest Neighbour, Logistic Regression, Linear SVM, Random Forest, Stacking Classifier and Maximum Voting Classifier.

While using TFIDF Vectorizer I created two different dataset :-

  • included only those words that occured minimum three times in the TEXT column and all features of the vectorizer ( option 1 )
  • included only the first 1000 features (option 2)

In this post we will talk about using Naive Bayes algorithm on the dataset created from option 1.

NaiveBayes

  • Probabilistic Model
  • ASSUMPTIOM — Conditional independence of Features
Introduction NaiveBayes Algorithm
Conditional Independence — assumption of Naive Bayes
Naive Bayes Final Equations

Training and Test Stage

Train stage time and space complexity

Test Stage or Runtime

Test Stage time complexity

Train / Runtime conclusion

  • Training stage we are storing only ratios/likelihoods
  • Runtime is much more memory efficient

Bias -Variance Trade off

High Bias → underfitting

High Variance → overfitting

There is only one parameter alpha (that’s part of Laplace smoothing) that determines under/over fitting

Case 1 alpha=0

Small changes in D-train results in large changes to the model. This large model change is termed as high variance which leads to overfitting

Case 2 alpha is large

In this case we will get into a situation of underfitting which leads to high bias.

We find the right alpha either using simple or 10-fold cross validation

Naive Bayes Implementation

Our dataset is Multinomial hence we will use Sklearns’s MultinomialNaiveBayes which has the following signature

class sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)

Naive Bayes Calling Code

Code to perform Hyper-parameter tuning

naibay.mnbayes_hyperparm_tuning(cancdiag.x_train_1hotCdg, cancdiag.x_cval_1hotCdg, cancdiag.y_trn, cancdiag.y_cval)

Code that uses the Hyper-parameter tuned values

naibay.mnbayes_using_hyperparm(cancdiag.X_test_1hotCdg,cancdiag.x_train_1hotCdg,cancdiag.x_cval_1hotCdg,cancdiag.y_test,cancdiag.y_trn,cancdiag.y_cval)

Code for testing the Naive Bayes Classifier

test_pt_idx = 4
no_of_ft = 100
naibay.clasifier_testing(cancdiag.y_test,test_pt_idx,no_of_ft)
naibay.get_impfeature_names(cancdiag.X_test,cancdiag.X_test['TEXT'].iloc[test_pt_idx],cancdiag.X_test['Gene'].iloc[test_pt_idx],cancdiag.X_test['Variation'].iloc[test_pt_idx],cancdiag.gene_feat_names , cancdiag.vrtn_feat_names,cancdiag.xtrain_txt_ft,no_of_ft)

Naive Bayes Hyper-parameter tuning output

Code for Confusion , Precision and Recall matrices

confsn_mtx = load_data(dirname,'/mnbayesop/confsn_mtx')
precsn_mtx = load_data(dirname,'/mnbayesop/precsn_mtx')
recall_mtx = load_data(dirname,'/mnbayesop/recall_mtx')
labels = load_data(dirname,'/mnbayesop/clas_lbls')
print("The cross validation log loss is:",cval_log_los)
no_mis_clf_pts = load_data(dirname,'/mnbayesop/hptun_numisclspts')
print("Number of missclassified point :",no_mis_clf_pts)
# representing A in heatmap format
print("-"*20, "Confusion matrix", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(confsn_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
print("-"*20, "Precision matrix (Columm Sum=1)", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(precsn_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
# representing B in heatmap format
print("-"*20, "Recall matrix (Row sum=1)", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(recall_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
Confusion Matrix
Precision Matrix
Recall Matrix

Conclusion

Naive Bayes Final Metrics table

Naive Bayes Classifier Inference and Conclusion

  • Percentage of mis-classified points is 43 percent hence for this dataset with these set of features Naive Bayes is not performing very well
  • From the confusion matrix we can see that the classifier is not able to differentiate other labels from Class 3 inspite of Class 3 having very less amount of data points

The complete code is available at https://github.com/ariyurjana/Personalized_Cancer_diagnosis

The code only for Naive Bayes is avialable at

https://github.com/ariyurjana/Personalized_Cancer_diagnosis/blob/final_case_study_1/PersCancDiag_NaiveBayes.ipynb

EDIT:

  • Code snippets I have included are not getting formatted properly hence I have tried to display code as images . Need to find a better way for displaying code.
  • the ipython notebook for the individual models will take some time to get upload to Github, sorry for this inconvenience

References

  1. www.appliedaicourse.com
  2. www.kaggle.com
  3. www.stackoverflow.com
  4. www.google.co.in

--

--

Janardhanan a r

In the making Machine Learner programmer music lover