Cancer Diagnosis — Random Forest
We are going to use the same dataset we used in the Personal Cancer Diagnosis artile that was publised earlier
We will not be dealing with the exploratory data analysis but will straight go ahead with the machine learning model implementation of the dataset.
In Random Forests we have
- Decision Trees as base learners
- Row Sampling with replacement
- Column / Feature Sampling
- Aggregation for Label generation
When using Random Forests for
- Classification for the final label generation we will use majority vote
- Regression the mean / median will be calculated for the target value
Base learners are Decision Trees with full depth upto 20 max. This produces low bias with high variance models.
High variance can be reduced by
- Row-sampling with replacement
- Column-sampling / Feature selection
- Aggregation of many models
Out-of-bag (OOB) score/error can be treated as Cross-validation score. This will ensure each model is reasonable / sensible.
As number of models increase variance reduces and model reduces variance increases.
Hyper-parameters
- Number of models — K
- Row-sampling rate
- Column sampling rate
Trivially Parallelizeable
Each of the models need only their corresponding datasets. Each of the models can be trained seperately on each core/cpu.
Each of the models can be trained seperately on each core / cpu
Time Space Complexity
Train time ==> O(n log n * d * k)
Runtime time ==> O(depth * K)
Runtime space ==> O(depth * k)
Dis-advantages
- Large dimensions
- Categorical variables with many dimensions
Random forests use Decision trees as base learner’s so let’s discuss a few things about them.
Decision Trees
Decision trees, can be thought of as nested if-else statements programatically. Geometrically they are a set of axis parallel hyper-planes.
Decision trees are also not a distance based measure for measuring the similarity between data points.
Let’s use the IRIS dataset and consider the two features Petal and Sepal length.
Plot Entropy
When the class labels are equally probable then entropy is maximum. but, when we move away so that one of class labels dominate then entropy decreases.
Random variable with k-values
Entropy of real-valued variables
If a random variable takes values between a & b and if values between a & b are equally probable then entropy is maximum
Information Gain
IG = Entropy(parent) — (weighted average of Entropy(child nodes))
Representing Information Gain and Entropy graphically
From 0 to 0.5 both entropy and Information -gain increase. The difference is that entropy increases at a faster rate when compared to Information-gain
From 0.5 to 1 both entropy and Information gain decrease.
Information-gain is preferred over entropy because Information-gain is only a square-term whereas entropy has a log-term which is slightly time consuming to compute
Sklearn uses Gini-impurity in its libraries
Overfit and Underfit
If the depth of the tree is large then the number of points in the leaf node will be small. It may also happen that there could be only a single point and even that could be noisy.
Interpretability of the model will decrease as there are too many if conditions.
Geometrically too many hyper cubes and cuboids.
The minimum depth we can have is 1, which is called a decision stump which will cause underfitting. Geometrically there will be one single axis parallel hyperplane that can seperate the data.
Decision Trees are GOOD
- comfortably handle large data
- dimensionality is reasonable
- low latency requirements
Random Forest Implementation
class sklearn.ensemble.RandomForestClassifier
(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
Of these many parameters we will use only two parameters max-depth and n_estimators, due to low compute power .
Random Forest Calling Code
The below code performs both hyper parameter tuning as well as using hyper paramater tuned values
Random Forest Hyper Parameter tuning output
Code for Confusion, Precision and Re-Call matrices
Conclusion
Random Forest Classifier Inference and Conclusion
- Percentage of mis-classified points is 36 percent hence for this dataset with these set of features Random Forest is performing well
- From the confusion matrix we can see that the classifier is not able to differentiate other labels from Class 3
- By adding other parameters to the Random forest model may be the performance may increase
The code for this is available at
https://github.com/ariyurjana/Personalized_Cancer_diagnosis
The PDf version is available at