How to interpret decision trees' graph results and find most informative features?

Question

I am using sk-learn python 27 and have output some decision tree feature results. Though I am not sure how to interpret the results. At first, I thought the features are listed from the most informative to least informative (from top to bottom), but examining the \nvalue it suggests otherwise. How do I identify the top 5 most informative features from the outputs or using python lines?

from sklearn import tree

tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file=treeFileName)     

# Output below
digraph Tree {
node [shape=box] ;
0 [label="avg-length <= 3.5\ngini = 0.0063\nsamples = 250000\nvalue = [249210, 790]"] ;
1 [label="name-entity <= 2.5\ngini = 0.5\nsamples = 678\nvalue = [338, 340]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="first-name=wm <= 0.5\ngini = 0.4537\nsamples = 483\nvalue = [168, 315]"] ;
1 -> 2 ;
3 [label="name-entity <= 1.5\ngini = 0.4016\nsamples = 435\nvalue = [121, 314]"] ;
2 -> 3 ;
4 [label="substring=ee <= 0.5\ngini = 0.4414\nsamples = 73\nvalue = [49, 24]"] ;
3 -> 4 ;
5 [label="substring=oy <= 0.5\ngini = 0.4027\nsamples = 68\nvalue = [49, 19]"] ;
4 -> 5 ;
6 [label="substring=im <= 0.5\ngini = 0.3589\nsamples = 64\nvalue = [49, 15]"] ;
5 -> 6 ;
7 [label="lastLetter-firstName=w <= 0.5\ngini = 0.316\nsamples = 61\nvalue = [49, 12]"] ;
6 -> 7 ;
8 [label="firstLetter-firstName=w <= 0.5\ngini = 0.2815\nsamples = 59\nvalue = [49, 10]"] ;
7 -> 8 ;
9 [label="substring=sa <= 0.5\ngini = 0.2221\nsamples = 55\nvalue = [48, 7]"] ;
... many many more lines below

MB-F · Accepted Answer · 2016-01-19T10:12:07.797

In Python you can use DecisionTreeClassifier.feature_importances_, which according to the documentation contains

The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [R66].

Simply do a np.argsort on the feature importances and you get a feature ranking (ties are not accounted for).
You can look at the Gini impurity (\ngini in the graphviz output) to get a first idea. Lower is better. However, be aware that you will need a way to combine impurity values if a feature is used in more than one split. Typically, this is done by taking the average information gain (or 'purity gain') over all splits on a given feature. This is done for you if you use feature_importances_.

Edit: I see the problem goes deeper than I thought. The graphviz thing is merely a graphical representation of the tree. It shows the tree and every split of the tree in detail. This is a representation of the tree, not of the features. Informativeness (or importance) of the features does not really fit into this representation because it accumulates information over multiple nodes of the tree.

The variable classifierUsed2.feature_importances_ contains importance information for every feature. If you get for example [0, 0.2, 0, 0.1, ...] the first feature has an importance of 0, the second feature has an importance of 0.2, the third feature has an importance of 0, the fourth feature an importance of 0.1, and so on.

Let's sort features by their importance (most important first):

rank = np.argsort(classifierUsed2.feature_importances_)[::-1]

Now rank contains the indices of the features, starting with the most important one: [1, 3, 0, 1, ...]

Want to see the five most important features?

print(rank[:5])

This prints the indices. What index corresponds to what feature? That's something you should know yourself because you supposedly constructed the feature matrix. Chances are, that this works:

print(dv.get_feature_names()[rank[:5]])

Or maybe this:

print('\n'.join(dv.get_feature_names()[i] for i in rank[:5]))

I added (print tree.feature_importances_) but it says 'module' object has no attribute 'feature_importances_' — KubiK888, Jan 19 '16 at 09:19
Sorry, I misread your code. I thought `tree` was the classifierobject. So in your case `classifierUsed2.feature_importances_` should work. — MB-F, Jan 19 '16 at 09:26
I tried, but it just added lists like these [ 0. 0. 0. ..., 0. 0. 0.] or [ 0.00924365 0. 0. ..., 0. 0. 0. ] — KubiK888, Jan 19 '16 at 09:41
What's wrong with these? A value of 0 means that this feature was not useful (probably not used at all in the tree). — MB-F, Jan 19 '16 at 09:49
am I supposed to replace this "tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file=treeFileName)" with this "tree.export_graphviz(classifierUsed2.feature_importances_, feature_names=dv.get_feature_names(), out_file=treeFileName)"? — KubiK888, Jan 19 '16 at 09:51
The problem is I don't know what these numbers are corresponding to which features. — KubiK888, Jan 19 '16 at 09:52
Why are the results (model.feature_importances_) different after running each time. The indices are different! What is the mathematical reasons for that ? in this case, how to trust this feature ranking. I have 205 features from 27 observations for a binary classification. Every time I run , I get different ranking . Is this method reliable? — Saman, Jan 17 '17 at 23:52
@saman I think this question deserves to be asked separately. What is your model and what exactly do you mean by running? If you you are talking about new fits each time there may be reasons. You probably won't get mich useful Information out oft such data anyway... — MB-F, Jan 18 '17 at 06:27

score 2 · Answer 2 · edited May 23 '17 at 12:06

As kazemakase already pointed out you can get the most important features using the classifier.feature_importances_:

print(sorted(list(zip(classifierUsed2.feature_importances_, dv.get_feature_names()))))

Just as an addendum, I personally prefer the following printing structure (modified from this question/answer):

# Print Decision rules:
def print_decision_tree(tree, feature_names):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, indent=""):
        if (threshold[node] != -2):
            print (indent+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
            if left[node] != -1:
                recurse (left, right, threshold, features,left[node],indent+"   ")
            print (indent+"} else {")
            if right[node] != -1:
                recurse (left, right, threshold, features,right[node],indent+"   ")
            print (indent+"}")
        else:
            print (indent+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0)

# Use it like this:
print_decision_tree(classifierUsed2, dv.get_feature_names())

Interesting, I wonder how I can interpret this zig zag tree. Are the top 5 features always the 5 most informative features? Or the next most informative occurs when there is an "else" turn? — KubiK888, Jan 19 '16 at 17:15
@KubiK888 Most likely but I don't know for sure as well. I think the feature_importances_ is pretty reliable, so you could compare the importance with the position of the feature in the tree. Unfortunately, the Berkeley university site, where the background of the procedure to compute the importances is described, is currently not available (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm). — Robin Spiess, Jan 20 '16 at 07:40

How to interpret decision trees' graph results and find most informative features?

2 Answers2

Linked