Introduction to Random Forest

Overview

Teaching: 10 min
Questions
  • How do I train a Random Forest in sklearn ?

  • How do I make a prediction on my trained model ?

  • How do I create diagnostic graphs to understand how my model is performing ?

Objectives
  • To use an existing api to create our first Random Forest.

  • To use that same api to understand how our model is performing.

While after going through this material you might be able to implement your own random forest from the ground up, what we’re going to do instead is use the scikit-learn implementation to go over how to : train, make a prediction, and compare our random forest to a decision tree.

We’re not using the iris data set for this example

While the iris data set is great for illustrating some of the concepts we’ve been exploring, it doesn’t have enough variance to make a random forest a better model than just a simple decision tree so for this example we’ll be using the breast cancer dataset instead.

from sklearn.tree     import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
 

bc = load_breast_cancer()
X = bc.data
y = bc.target

# Create our test/train split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)


## build our models 
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier(n_estimators=100)

## Train the classifiers
decision_tree.fit(X_train, y_train)
random_forest.fit(X_train, y_train)

# Create Predictions
dt_pred = decision_tree.predict(X_test)
rf_pred = random_forest.predict(X_test)

# Check the performance of each model 
print('Decision Tree Model')
print(classification_report(y_test, dt_pred, target_names=bc.target_names))

print('Random Forest Model')
print(classification_report(y_test, rf_pred, target_names=bc.target_names))

#Graph our confusion matrix
dt_cm = confusion_matrix(y_test, dt_pred)
rf_cm = confusion_matrix(y_test, rf_pred)
Decision Tree Model
             precision    recall  f1-score   support
          0       0.91      0.94      0.93        54
          1       0.97      0.94      0.95        89
avg / total       0.94      0.94      0.94       143
Random Forest Model
             precision    recall  f1-score   support
          0       0.96      0.94      0.95        54
          1       0.97      0.98      0.97        89
avg / total       0.97      0.97      0.96       143

It appears that based on common metrics of classification model performance, the random forest out performs the decision tree.

Let’s see what the performance increase actually looks like ( code here ) :

test test

As you can see by using our random forest we’re able to increase the number of correctly predicted benign tumors and decrease the number of benign tumors that are predicted as malignant.

By using a random forest, we can more accurately predict the state of a tumor potentially: decreasing the amount of unneeded procedures performed on patients and decreasing patient stress about their diagnosis.

Hyper-parameter tuning

This is usually where you’d start investigating hyper parameter tuning of a model. This is a crucial part of the modeling process in order to ensure your model is optimal, it is however outside of the scope of this document.

Key Points

  • Random forest’s are very simply to train in scikit learn