Cancer Diagnosis — Random Forest

We are going to use the same dataset we used in the Personal Cancer Diagnosis artile that was publised earlier

We will not be dealing with the exploratory data analysis but will straight go ahead with the machine learning model implementation of the dataset.

In Random Forests we have

• Decision Trees as base learners
• Row Sampling with replacement
• Column / Feature Sampling
• Aggregation for Label generation

When using Random Forests for

• Classification for the final label generation we will use majority vote
• Regression the mean / median will be calculated for the target value

Base learners are Decision Trees with full depth upto 20 max. This produces low bias with high variance models.

High variance can be reduced by

• Row-sampling with replacement
• Column-sampling / Feature selection
• Aggregation of many models

Out-of-bag (OOB) score/error can be treated as Cross-validation score. This will ensure each model is reasonable / sensible.

As number of models increase variance reduces and model reduces variance increases.

• Number of models — K
• Row-sampling rate
• Column sampling rate

Each of the models need only their corresponding datasets. Each of the models can be trained seperately on each core/cpu.

Each of the models can be trained seperately on each core / cpu

Train time ==> O(n log n * d * k)

Runtime time ==> O(depth * K)

Runtime space ==> O(depth * k)

• Large dimensions
• Categorical variables with many dimensions

Random forests use Decision trees as base learner’s so let’s discuss a few things about them.

Decision trees, can be thought of as nested if-else statements programatically. Geometrically they are a set of axis parallel hyper-planes.

Decision trees are also not a distance based measure for measuring the similarity between data points.

Let’s use the IRIS dataset and consider the two features Petal and Sepal length.

Plot Entropy

When the class labels are equally probable then entropy is maximum. but, when we move away so that one of class labels dominate then entropy decreases.

Random variable with k-values

If a random variable takes values between a & b and if values between a & b are equally probable then entropy is maximum

IG = Entropy(parent) — (weighted average of Entropy(child nodes))

Representing Information Gain and Entropy graphically

From 0 to 0.5 both entropy and Information -gain increase. The difference is that entropy increases at a faster rate when compared to Information-gain

From 0.5 to 1 both entropy and Information gain decrease.

Information-gain is preferred over entropy because Information-gain is only a square-term whereas entropy has a log-term which is slightly time consuming to compute

Sklearn uses Gini-impurity in its libraries

If the depth of the tree is large then the number of points in the leaf node will be small. It may also happen that there could be only a single point and even that could be noisy.

Interpretability of the model will decrease as there are too many if conditions.

Geometrically too many hyper cubes and cuboids.

The minimum depth we can have is 1, which is called a decision stump which will cause underfitting. Geometrically there will be one single axis parallel hyperplane that can seperate the data.

• comfortably handle large data
• dimensionality is reasonable
• low latency requirements

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

Of these many parameters we will use only two parameters max-depth and n_estimators, due to low compute power .

The below code performs both hyper parameter tuning as well as using hyper paramater tuned values

Random Forest Hyper Parameter tuning output

Code for Confusion, Precision and Re-Call matrices

Conclusion

• Percentage of mis-classified points is 36 percent hence for this dataset with these set of features Random Forest is performing well
• From the confusion matrix we can see that the classifier is not able to differentiate other labels from Class 3
• By adding other parameters to the Random forest model may be the performance may increase

The code for this is available at

https://github.com/ariyurjana/Personalized_Cancer_diagnosis