Cancer Diagnosis — Random Forest

5 min readMay 2, 2021

We are going to use the same dataset we used in the Personal Cancer Diagnosis artile that was publised earlier

We will not be dealing with the exploratory data analysis but will straight go ahead with the machine learning model implementation of the dataset.

In Random Forests we have

Decision Trees as base learners
Row Sampling with replacement
Column / Feature Sampling
Aggregation for Label generation

When using Random Forests for

Classification for the final label generation we will use majority vote
Regression the mean / median will be calculated for the target value

Base learners are Decision Trees with full depth upto 20 max. This produces low bias with high variance models.

High variance can be reduced by

Row-sampling with replacement
Column-sampling / Feature selection
Aggregation of many models

Out-of-bag (OOB) score/error can be treated as Cross-validation score. This will ensure each model is reasonable / sensible.

As number of models increase variance reduces and model reduces variance increases.

Hyper-parameters

Number of models — K
Row-sampling rate
Column sampling rate

Trivially Parallelizeable

Each of the models need only their corresponding datasets. Each of the models can be trained seperately on each core/cpu.

Each of the models can be trained seperately on each core / cpu

Time Space Complexity

Train time ==> O(n log n * d * k)

Runtime time ==> O(depth * K)

Runtime space ==> O(depth * k)

Dis-advantages

Large dimensions
Categorical variables with many dimensions

Random forests use Decision trees as base learner’s so let’s discuss a few things about them.

Decision Trees

Decision trees, can be thought of as nested if-else statements programatically. Geometrically they are a set of axis parallel hyper-planes.

Decision trees are also not a distance based measure for measuring the similarity between data points.

Let’s use the IRIS dataset and consider the two features Petal and Sepal length.

Plot Entropy

When the class labels are equally probable then entropy is maximum. but, when we move away so that one of class labels dominate then entropy decreases.

Random variable with k-values

Entropy of real-valued variables

If a random variable takes values between a & b and if values between a & b are equally probable then entropy is maximum

Information Gain

IG = Entropy(parent) — (weighted average of Entropy(child nodes))

Representing Information Gain and Entropy graphically

From 0 to 0.5 both entropy and Information -gain increase. The difference is that entropy increases at a faster rate when compared to Information-gain

From 0.5 to 1 both entropy and Information gain decrease.

Information-gain is preferred over entropy because Information-gain is only a square-term whereas entropy has a log-term which is slightly time consuming to compute

Sklearn uses Gini-impurity in its libraries

Overfit and Underfit

If the depth of the tree is large then the number of points in the leaf node will be small. It may also happen that there could be only a single point and even that could be noisy.

Interpretability of the model will decrease as there are too many if conditions.

Geometrically too many hyper cubes and cuboids.

The minimum depth we can have is 1, which is called a decision stump which will cause underfitting. Geometrically there will be one single axis parallel hyperplane that can seperate the data.

Decision Trees are GOOD

comfortably handle large data
dimensionality is reasonable
low latency requirements

Random Forest Implementation

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

Of these many parameters we will use only two parameters max-depth and n_estimators, due to low compute power .

Random Forest using Hyper parameters — part 1

Random forest using Hyper parameter — part 2

Random Forest Calling Code

The below code performs both hyper parameter tuning as well as using hyper paramater tuned values

Random Forest Hyper Parameter tuning output

Code for displaying Hyper parameter tuning output

Code for Confusion, Precision and Re-Call matrices

Conclusion

Random Forest Classifier Inference and Conclusion

Percentage of mis-classified points is 36 percent hence for this dataset with these set of features Random Forest is performing well
From the confusion matrix we can see that the classifier is not able to differentiate other labels from Class 3
By adding other parameters to the Random forest model may be the performance may increase

The code for this is available at

https://github.com/ariyurjana/Personalized_Cancer_diagnosis

The PDf version is available at

https://github.com/ariyurjana/Personalized_Cancer_diagnosis/blob/final_case_study_1/PersCancDiag_RandomForest.html