Introduction to Random Forest

Overview

Teaching: 10 min
Questions
  • What is a Decision Tree ?

  • What are the major drawbacks of Decision Trees ?

Objectives
  • To understand how a decision tree is built and used.

  • To understand the limitations of decision trees.

In order to understand a random forest, some general background on decision trees is needed.

What is a decision tree :

Classification and Regression Tree models or CART models were introduced by Breimen et al. . In these models a top down approach is applied to observation data. The general idea is that given a set of observations the following question is asked; Is every target variable in this set the same (or nearly the same)?

If yes label the set of observations with the most frequent class, if no find the best rule that splits the observations into the purest set of observations.

How do they work ?

An example as applied to the iris data set :

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier,export_graphviz
import graphviz

iris_data = load_iris()
model = DecisionTreeClassifier()
model.fit(iris_data.data, iris_data.target)

dot = export_graphviz(model, 
                        out_file=None,
                        feature_names=iris_data.feature_names,
                        class_names=iris_data.target_names,
                        filled=True,
                        impurity=None,
                        )

graph = graphviz.Source(dot)
graph.render("iris_decision_tree")

In this tree, the decision for determining the species of an iris is as follows :

iris-decision-tree

To read this tree start from the top white node, using first line to determine how the decision was made to split the current observations into two new nodes.

Using this decision tree we can now classify new observations:

Observation 1:

A flower with a petal width of 0.7 , petal length of 1.0, and a sepal width of 3.0.

From the root node :

  • Is the petal width < 0.8 ?
    • Yes -> go left.

The flower is from the species setosa.

Observation 2:

A flower with a petal width of 0.9 , petal length of 1.0, and a sepal width of 3.0.

Solution

From the root node :

  • Is the petal width less than or equal to 0.8 ?
    • No -> go right.
  • Is the petal width less than or equal to 1.75 ?
    • Yes -> go left.
  • Is the petal length less than or equal to 4.95 ?
    • Yes -> go left.
  • Is the petal width less than or equal to 1.65 ?
    • Yes -> go left

The flower is from species versicolor.

Limitations to Decision Trees :

While great for producing models that are easy to understand and implement, decision trees also tend to over fit on their training data - making them perform poorly if data they are shown later doesn’t closely match what they were trained on.

In the special case of regression trees, they also can only predict within the range of labels that they’ve seen before meaning that they have explicit upper and lower bounds on the numbers they can produce.

Key Points

  • Decision trees are the fundamental building block of the Random Forest.

  • They provide an explainable and human understandable model for making predictions.

  • They also tend to over fit training data making them poor at predictive tasks where data doesn’t exactly match what they’ve seen previously.