I have the following example code for a simple random forest classifier on the iris dataset using just 2 decision trees. This code is best run inside a jupyter notebook.
# Setup
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import numpy as np
# Set seed for reproducibility
np.random.seed(1015)
# Load the iris data
iris = load_iris()
# Create the train-test datasets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)
np.random.seed(1039)
# Just fit a simple random forest classifier with 2 decision trees
rf = RandomForestClassifier(n_estimators = 2)
rf.fit(X = X_train, y = y_train)
# Define a function to draw the decision trees in IPython
# Adapted from: http://scikit-learn.org/stable/modules/tree.html
from IPython.display import display, Image
import pydotplus
# Now plot the trees individually
for dtree in rf.estimators_:
dot_data = tree.export_graphviz(dtree
, out_file = None
, filled = True
, rounded = True
, special_characters = True)
graph = pydotplus.graph_from_dot_data(dot_data)
img = Image(graph.create_png())
display(img)
draw_tree(inp_tree = dtree)
#print(dtree.tree_.feature)
The output for the first tree is:
As can be observed the first decision has 8 leaf nodes and the second decision tree (not shown) has 6 leaf nodes
How do I extract a simple numpy array which contains information for each decision tree, and each leaf node in the tree:
- the classification outcome for that leaf node (e.g. most frequent class it predicted)
- all the features (boolean) used in the decision path to that same leaf node?
In the above example we would have:
- 2 trees -
{0, 1}
- for tree
{0}
we have 8 leaf nodes indexed{0, 1, ..., 7}
- for tree
{1}
we have 6 leaf nodes indexed{0, 1, ..., 5}
- for each leaf node in each tree we have a single most frequent predicted class i.e.
{0, 1, 2}
for the iris dataset - for each leaf node we have a set of boolean values for the 4 features that were used to make that tree. Here if one of the 4 features is used one or more times in the decision path to a leaf node we count it as a
True
otherwiseFalse
if it is never used in the decision path to the leaf node.
Any help adapting this numpy
array into the above code (loop) is appreciated.
Thanks