Personal Cancer Daignosis
Problem statement :
Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
Data
Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/
Data: Thanks to Memorial Sloan Kettering Cancer Center (MSKCC) for curating this dataset a great help to humanity.
Business Objective and Constraints
- There is no Latency requirement
- High accuracy is required as we are dealing with human lives
- Errors are very costly
- Interpretability is very important if the AI/ML model needs to be accepted by the Medical Community. The medical community needs to know how the model arrived at the classification rather than being a black-box.
- This being a multi-class classification , the model needs to output probability scores which will help the doctor’s in case of ambiguity
Data Overview
There are two files that contain data.
The files have a common column ID referring to the patient ID.The file training_variants.txt contains the patient ID, gene information, variations and class label. The other file training_text contains the patient ID, and the complete medical record of the patient.
Class Label
There are nine different genetic mutations that are possible. Our class label will have nine distinct values.
Medical Record
The patient would have gone through a battery of tests . This field is going to have details of all those test , transcribed in plain english . for example, if a blood test was conducted on the patient the results of that blood test would be transcribed into english and stored here. This transcription would be at the micro level or every small nitty gritty detail of that test would be included in this field. Every medical test ever conducted on the patient would be converted to englist at the most granular level and stored in this field.
Setting up the ML Problem
Objective:
Predict the class label and probability for each data point belonging to the nine uique classes
Our class label has nine distinct value, hence this turns out into a Multi-class classification problem
Model Evaluation
- This being classification we will have to draw the Confusion matrix
- Multi-class log loss
Constraints
- There is no Latency requirement
- High accuracy is required as we are dealing with human lives
- Errors are very costly
- Interpretability is very important if the AI/ML model needs to be accepted by the Medical Community. The medical community needs to know how the model arrived at the classification rather than being a black-box.
- This being a multi-class classification , the model needs to output probability scores which will help the doctor’s in case of ambiguity
Performance Metrics
- Confusion matrix
- Multi-class Logloss
Data Discovery
data = pd.read_csv('training_variants')
print('Number of data points : ', data.shape[0])
print('Number of features : ', data.shape[1])
print('Features : ', data.columns.values)
data.head()
Training_variants is a comma separated file containing the description of the genetic mutations .
The fields are:
- ID : the id of the row used to link the mutation to the clinical evidence
- Gene : the gene where this genetic mutation is located
- Variation : the aminoacid change for this mutations
- Class : 1–9 the class label this genetic mutation has been classified as
# note the seprator in this file
data_text =pd.read_csv("training_text",sep="\|\|",engine="python",names=["ID","TEXT"],skiprows=1)
print('Number of data points : ', data_text.shape[0])
print('Number of features : ', data_text.shape[1])
print('Features : ', data_text.columns.values)
data_text.head()
Preprocessing of Text
We will perform some natural language processing , nothing fancy just simple ones like:
- repalce special character a-zA-Z0–9 with blank spaces
- replace multiple spaces with single space
- convert all characters to lower case
- drop all stop words
We perform the above tasks using regular expressions . The stop words removal is from the NLTk library
from nltk.corpus import stopwords# loading stop words from nltk library
stop_words = set(stopwords.words('english'))def nlp_preprocessing(total_text, index, column):
if type(total_text) is not int:
string = ""
# replace every special char with space
total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)
# replace multiple spaces with single space
total_text = re.sub('\s+',' ', total_text)
# converting all the chars into lower-case.
total_text = total_text.lower()
for word in total_text.split():
# if the word is a not a stop word then retain that word #from the data
if not word in stop_words:
string += word + " "
data_text[column][index] = string#text processing stage.
start_time = time.clock()
for index, row in data_text.iterrows():
if type(row['TEXT']) is str:
nlp_preprocessing(row['TEXT'], index, 'TEXT')
else:
print("there is no text description for id:",index)
Let’s merge the gene_variation and medical records text data using ID column
#merging both gene_variations and text data based on ID
result = pd.merge(data, data_text,on='ID', how='left')
result.head()
Next will be some data cleaning task
result[result.isnull().any(axis=1)]
Wherever the TEXT column does not have data let’s concatenate the Gene and Variation column and store it back into the TEXT column
We will also replace blank strings with underscores for the Gene and Variations column.
result.loc[result['TEXT'].isnull(),'TEXT'] = result['Gene'] +' '+result['Variation']
result.Gene = result.Gene.str.replace('\s+', '_')
result.Variation = result.Variation.str.replace('\s+', '_')
As always the first thing to do would be split the entire dataset into train,cross-validation and test datasets. We will use sklearn’s train_test_split function to do this.
The CLASS column is the ground truth hence we will seperate it out from the main dataset and then split the dataset into train and test.
y_true = result['Class'].values
Train, Cross-validation, Test
The following splits the dataset into train and test
# split the data into test and train by maintaining same #distribution of output varaible 'y_true' [stratify=y_true]
X_train, test_df, y_train, y_test = train_test_split(result, y_true, stratify=y_true, test_size=0.2)
We use the X_train and y_train and further split into train and cross-validation
# split the train data into train and cross validation by #maintaining same distribution of output varaible 'y_train' #[stratify=y_train]
train_df, cv_df, y_train, y_cv = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2)
Let’s print shape of the datasets
print('Number of data points in train data:', train_df.shape[0])
print('Number of data points in test data:', test_df.shape[0])
print('Number of data points in cross validation data:', cv_df.shape[0])
Next we need to check the distribution of class labels across the three datasets
Distribution of class labels
Common Function to draw the bar graph for class labels
def clasdistrbn(df1,df2,hdng):
"""
draws a bar graph for the series data present in df1
"""
my_colors = 'rgbkymc'
df1.plot(kind='bar',color=list(my_colors))
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title(hdng)
plt.grid()
plt.show()
sorted_yi = np.argsort(-df1.values)
for i in sorted_yi:
print('Number of data points in class', i+1, ':',df1.values[i], '(', np.round((df1.values[i]/df2.shape[0]*100), 3), '%)')
The code to call the above common function.
Create the class distribution datasets
# it returns a dict, keys as class labels and values as the number of data points in that class
train_class_distribution = train_df['Class'].value_counts().sort_index()
test_class_distribution = test_df['Class'].value_counts().sort_index()
cv_class_distribution = cv_df['Class'].value_counts().sort_index()
Train
clasdistrbn(train_class_distribution,train_df,'Distribution of yi in train data')
Test
clasdistrbn(test_class_distribution,test_df,'Distribution of yi in test data')
Cross Validation
clasdistrbn(cv_class_distribution,cv_df,'Distribution of yi in cross validation data')
Univariate Analysis on Features of the Dataset
Gene Feature
Distribution and total count of Gene in the three datasets
Variation Feature
Distribution and count of Variations in the three datasets
Data Preparation
Featurization for the Gene column
We will first create a dictionary which contains the probability array for each gene/variation
# get_gene_ft_dict: Get Gene varaition Feature Dict
def get_gv_fea_dict(self,alpha, feature, df):
value_count = df[feature].value_counts()
#print(value_count)
# gv_dict : Gene Variation Dict, which contains the probability array for each gene/variation
self.gv_dict = dict()
# denominator will contain the number of time that particular feature occured in whole data
for i, denominator in value_count.items():
# vec will contain (p(yi==1/Gi) probability of gene/variation belongs to perticular class
# vec is 9 diamensional vector
vec = []
#this gives the len of unique classes in the data frame
j = len(df.groupby('Class').nunique())
self.no_of_clas= j
j +=1
for k in range(1,j):cls_cnt = df.loc[(df['Class']==k) & (df[feature]==i)]
#print(cls_cnt.shape[0],"********************")
# cls_cnt.shape[0](numerator) will contain the number of time that particular feature occured in whole data
vec.append((cls_cnt.shape[0] + alpha)/ (denominator + j*alpha))
#print(vec)
# we are adding the gene/variation to the dict as key and vec as value
self.gv_dict[i]=vec
return self.gv_dict
For every feature values in the given data frame we will check if it is there in the train data then we will add the feature to gv_fea
If it does not exist we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_fea list
# Get Gene variation feature
def get_gv_feature(self,alpha, feature, df):gv_dict = self.get_gv_fea_dict(alpha, feature, df)
# value_count is similar in get_gv_fea_dict
value_count = df[feature].value_counts()
# gv_fea: Gene_variation feature, it will contain the feature for each feature value in the data
self.gv_fea = []
# for every feature values in the given data frame we will check if it is there in the train data then we will add the feature to gv_fea
# if not we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_fea
for index, row in df.iterrows():
if row[feature] in dict(value_count).keys():
self.gv_fea.append(gv_dict[row[feature]])
else:
self.gv_fea.append([1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9])
return self.gv_fea
By passing “Variant”, as the value for the feature parameter the above two functions can be re-used to featurize the variant column in our dataset.
Featurization for TEXT column
First we create a response coding for this TEXT column.
The other feature we create will be using the TFIDFVectorizer on the TEXT column
def text_ft_conv(self):
#response coding of text features
self.cls_lbl_dict()
alpha=10
# dict_list[i] is build on i'th class text data
# total_dict is buid on whole training text data
self.total_dict = self.crea_dict_frmtxt(self.x_trn)self.Xtest_txt_ft_respCdg = self.get_text_responsecoding(self.X_test,alpha)
self.xtrain_txt_ft_respCdg = self.get_text_responsecoding(self.x_trn,alpha)
self.cval_txt_ft_respCdg = self.get_text_responsecoding(self.x_cval,alpha)# we convert each row values such that they sum to 1
self.Xtest_txt_ft_respCdg = (self.Xtest_txt_ft_respCdg.T/self.Xtest_txt_ft_respCdg.sum(axis=1)).T
self.xtrain_txt_ft_respCdg = (self.xtrain_txt_ft_respCdg.T/self.xtrain_txt_ft_respCdg.sum(axis=1)).T
self.cval_txt_ft_respCdg = (self.cval_txt_ft_respCdg.T/self.cval_txt_ft_respCdg.sum(axis=1)).T#one hot encodings for text
# building a TfidfVectorizer with all the words that occured minimum 3 times in train data
self.txt_vectrzr_trntst = TfidfVectorizer(min_df=3)self.xtrain_txt_ft_1hotCdg = self.txt_vectrzr_trntst.fit_transform(self.x_trn['TEXT'])
self.xtrain_txt_ft_1hotCdg = normalize(self.xtrain_txt_ft_1hotCdg, axis=0)# getting all the feature names (words)
self.xtrain_txt_ft= self.txt_vectrzr_trntst.get_feature_names()
# we use the same vectorizer that was trained on train data
self.Xtest_txt_ft_1hotCdg = self.txt_vectrzr_trntst.transform(self.X_test['TEXT'])
# don't forget to normalize every feature
self.Xtest_txt_ft_1hotCdg = normalize(self.Xtest_txt_ft_1hotCdg, axis=0)# we use the same vectorizer that was trained on train data
self.xcval_txt_ft_1hotCdg = self.txt_vectrzr_trntst.transform(self.x_cval['TEXT'])
# don't forget to normalize every feature
self.xcval_txt_ft_1hotCdg = normalize(self.xcval_txt_ft_1hotCdg, axis=0)
print("train_txt_feature_onehotCoding . shape of gene :", self.xtrain_txt_ft_1hotCdg.shape)
print("cval_txt_feature_onehotCoding . shape of gene :", self.xcval_txt_ft_1hotCdg.shape)
At this stage we will have the response and one-hot encoding for the three features(Gene,Variant and Text) in the dataset.
Stacking these features to make the final dataset that can be input into the model.
def stack_3_features(self): print('into stacking features') self.X_test_gn_vrtn_1hotcdg = hstack([self.Xtest_gn_ft_1hotCdg,self.Xtest_vrtn_ft_1hotCdg])
self.x_train_gn_vrtn_1hotcdg = hstack([self.train_gn_ft_1hotCdg,self.xtrain_vrtn_ft_1hotCdg])
self.x_cval_gn_vrtn_1hotcdg = hstack([self.cval_gn_ft_1hotCdg,self.cval_vrtn_ft_1hotCdg]) self.X_test_1hotCdg = hstack(([self.X_test_gn_vrtn_1hotcdg,self.Xtest_txt_ft_1hotCdg]))
self.y_test = np.array(list(self.X_test['Class']))self.x_train_1hotCdg = hstack(([self.x_train_gn_vrtn_1hotcdg,self.xtrain_txt_ft_1hotCdg]))
self.y_trn = np.array(list(self.x_trn['Class']))self.x_cval_1hotCdg= hstack(([self.x_cval_gn_vrtn_1hotcdg,self.xcval_txt_ft_1hotCdg]))
self.y_cval = np.array(list(self.x_cval['Class']))#response coding
X_test_gn_vrtn_respCdg = np.hstack([self.Xtest_gn_ft_respCdg,self.Xtest_vrtn_ft_respCdg])
x_trn_gn_vrtn_respCdg = np.hstack([self.trn_gn_ft_respCdg,self.xtrain_vrtn_ft_respCdg])
x_cval_gn_vrtn_respCdg = np.hstack([self.cval_gn_ft_respCdg,self.cval_gn_ft_respCdg]) self.X_test_respCdg = np.hstack([X_test_gn_vrtn_respCdg,self.Xtest_txt_ft_respCdg])
self.x_trn_respCdg = np.hstack([x_trn_gn_vrtn_respCdg,self.xtrain_txt_ft_respCdg])
self.x_cval_respCdg = np.hstack([x_cval_gn_vrtn_respCdg,self.cval_txt_ft_respCdg])print("One hot encoding features :")
print("(number of data points * number of features) in test data = ", self.X_test_1hotCdg.shape)
print("(number of data points * number of features) in train data = ", self.x_train_1hotCdg.shape)
print("(number of data points * number of features) in cross validation data =", self.x_cval_1hotCdg.shape)print("Response encoding features :")
print("(number of data points * number of features) in test data = ", self.X_test_respCdg.shape)
print("(number of data points * number of features) in train data = ", self.x_trn_respCdg.shape)
print("(number of data points * number of features) in cross validation data =", self.x_cval_respCdg.shape)
I have used the above data in seven different Machine Learning models Naive Bayes, K-Nearest Neighbour, Logistic Regression, Linear SVM, Random Forest, Stacking Classifier and Maximum Voting Classifier.
While using TFIDF Vectorizer I created two different dataset :-
- included only those words that occured minimum three times in the TEXT column and all features of the vectorizer ( option 1 )
- included only the first 1000 features (option 2)
In this post we will talk about using Naive Bayes algorithm on the dataset created from option 1.
NaiveBayes
- Probabilistic Model
- ASSUMPTIOM — Conditional independence of Features
Training and Test Stage
Test Stage or Runtime
Train / Runtime conclusion
- Training stage we are storing only ratios/likelihoods
- Runtime is much more memory efficient
Bias -Variance Trade off
High Bias → underfitting
High Variance → overfitting
There is only one parameter alpha (that’s part of Laplace smoothing) that determines under/over fitting
Case 1 alpha=0
Small changes in D-train results in large changes to the model. This large model change is termed as high variance which leads to overfitting
Case 2 alpha is large
In this case we will get into a situation of underfitting which leads to high bias.
We find the right alpha either using simple or 10-fold cross validation
Naive Bayes Implementation
Our dataset is Multinomial hence we will use Sklearns’s MultinomialNaiveBayes which has the following signature
class sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)
Naive Bayes Calling Code
Code to perform Hyper-parameter tuning
naibay.mnbayes_hyperparm_tuning(cancdiag.x_train_1hotCdg, cancdiag.x_cval_1hotCdg, cancdiag.y_trn, cancdiag.y_cval)
Code that uses the Hyper-parameter tuned values
naibay.mnbayes_using_hyperparm(cancdiag.X_test_1hotCdg,cancdiag.x_train_1hotCdg,cancdiag.x_cval_1hotCdg,cancdiag.y_test,cancdiag.y_trn,cancdiag.y_cval)
Code for testing the Naive Bayes Classifier
test_pt_idx = 4
no_of_ft = 100
naibay.clasifier_testing(cancdiag.y_test,test_pt_idx,no_of_ft)
naibay.get_impfeature_names(cancdiag.X_test,cancdiag.X_test['TEXT'].iloc[test_pt_idx],cancdiag.X_test['Gene'].iloc[test_pt_idx],cancdiag.X_test['Variation'].iloc[test_pt_idx],cancdiag.gene_feat_names , cancdiag.vrtn_feat_names,cancdiag.xtrain_txt_ft,no_of_ft)
Naive Bayes Hyper-parameter tuning output
Code for Confusion , Precision and Recall matrices
confsn_mtx = load_data(dirname,'/mnbayesop/confsn_mtx')
precsn_mtx = load_data(dirname,'/mnbayesop/precsn_mtx')
recall_mtx = load_data(dirname,'/mnbayesop/recall_mtx')
labels = load_data(dirname,'/mnbayesop/clas_lbls')print("The cross validation log loss is:",cval_log_los)
no_mis_clf_pts = load_data(dirname,'/mnbayesop/hptun_numisclspts')print("Number of missclassified point :",no_mis_clf_pts)
# representing A in heatmap format
print("-"*20, "Confusion matrix", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(confsn_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()print("-"*20, "Precision matrix (Columm Sum=1)", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(precsn_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()# representing B in heatmap format
print("-"*20, "Recall matrix (Row sum=1)", "-"*20)
plt.figure(figsize=(20,7))
sns.heatmap(recall_mtx, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.show()
Conclusion
Naive Bayes Classifier Inference and Conclusion
- Percentage of mis-classified points is 43 percent hence for this dataset with these set of features Naive Bayes is not performing very well
- From the confusion matrix we can see that the classifier is not able to differentiate other labels from Class 3 inspite of Class 3 having very less amount of data points
The complete code is available at https://github.com/ariyurjana/Personalized_Cancer_diagnosis
The code only for Naive Bayes is avialable at
EDIT:
- Code snippets I have included are not getting formatted properly hence I have tried to display code as images . Need to find a better way for displaying code.
- the ipython notebook for the individual models will take some time to get upload to Github, sorry for this inconvenience