Introduction to Random Forest

Overview

Teaching: 5 min
Questions
  • How can I determine how important a variable is to the model?

Objectives
  • To learn how Gini impurity is calculated.

  • To learn how Gini importance is calculated.

Feature importance is used to describe how much impact a feature (column) has on how the model makes decisions. They’re used in order to help guide future work by concentrating on things we are certain will have impact and perhaps ignoring things that don’t, for simplifying models and preventing over fitting by removing columns that aren’t impactful enough to be generalized, and for helping explain why a model is making the decisions it does.

More than one way to skin a cat

There are multiple methods for determining feature importance from a model, we’ll be covering how they’re calculated in the scikit-learn implementation . This method is popular because its cheap to compute, does a reasonably good job of determining importance, and is already implemented.

Unsurprisingly, in order to calculate the feature importance of the forest; we need to calculate the feature importance of the individual trees and then find a way to combine them.

Gini Impurity

Gini impurity is a measure of the chance that a new observation when randomly classified would be incorrect. It is bounded between 0 and 1(0 being impossible to be wrong, 1 being guaranteed to be wrong).

Gini impurity of a node is calculated with the following equation :

Where is a predicted category and is the probability of a record being assigned to class at random.

Example Calculation

Based on our first decision tree :

drawing

The gini impurity for the top node is :

50/150 * (1 - 50/150) + 50/150 * (1 - 50/150) + 50/150 * (1 - 50/150) = .667

What would the Gini impurity of next node to the left be ?

Solution

50/50 * 1 - (50/50)

+ 0/50 * 1 - ( 0/ 50 )

+ 0/50 * 1 - ( 0/ 50 )

= 0

Gini Importance

The Gini importance is the total reduction of the Gini Impurity that comes from a feature. It is calculated as the weighted sum of the difference in Gini Impurity between a node and its antecedents. It is calculated using the below equation.

Example Calculation

Based on our first decision tree :

drawing

The gini importance of sepal width is :

What would the importance of petal length be ?

Solution

54 * 0.168 - (48 * 0.041 + 6 * 0.444)

+ 46 * 0.043 - (0 + 3 * 0.444)

+ 3 * 0.444 - (0 + 0)

= 6.418

Once you have the Gini importance of each feature, you simply divide by the sum of each importance to get the normalized feature importance for the model.

What’s the normalized feature importances from this model ?

Solution

Sum of Gini Importances = 100.05

sepal length = 0 / 100.05 = 0

sepal width = 1.332 / 100.05 = 0.0133

petal length = 6.418 / 100.05 = 0.064

petal width = 92.30 / 100.05 = 0.922

Expanding to the Random Forest

Now that we can calculate feature importance for the weak learners, expanding it to the ensembled model is as simple as calculating the average importance for a feature from the trees as the importance of the random forest.

Getting Feature Importance via sklearn

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data
y = iris.target

# Create our test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# build our model
random_forest = RandomForestClassifier(n_estimators=100)

# Train the classifier
random_forest.fit(X_train, y_train)

#Get our features and weights
feature_list = sorted(zip(map(lambda x: round(x, 2), random_forest.feature_importances_), iris.feature_names),
             reverse=True)

# Print them out
print('feature\t\timportance')
print("\n".join(['{}\t\t{}'.format(f,i) for i,f in feature_list]))
print('total_importance\t\t',  sum([i for i,f in feature_list]))
feature		importance
petal length (cm)		0.47
petal width (cm)		0.39
sepal length (cm)		0.1
sepal width (cm)		0.04
total_importance		 1.0

Based on these weights, its pretty clear that petal shape plays a big role in determining iris species. This might allow us to make recommendations about how we go about collecting data in the future (or not collecting data) or might give us some ideas about engineering new features around petal morphology to make our model better in the future.

Key Points

  • Gini impurity is a measure of how pure a node is.

  • Gini importance is a measure of how important a feature is for the final model.