1

I am using sk-learn python 27 and have output some decision tree feature results. Though I am not sure how to interpret the results. At first, I thought the features are listed from the most informative to least informative (from top to bottom), but examining the \nvalue it suggests otherwise. How do I identify the top 5 most informative features from the outputs or using python lines?

from sklearn import tree

tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file=treeFileName)     

# Output below
digraph Tree {
node [shape=box] ;
0 [label="avg-length <= 3.5\ngini = 0.0063\nsamples = 250000\nvalue = [249210, 790]"] ;
1 [label="name-entity <= 2.5\ngini = 0.5\nsamples = 678\nvalue = [338, 340]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="first-name=wm <= 0.5\ngini = 0.4537\nsamples = 483\nvalue = [168, 315]"] ;
1 -> 2 ;
3 [label="name-entity <= 1.5\ngini = 0.4016\nsamples = 435\nvalue = [121, 314]"] ;
2 -> 3 ;
4 [label="substring=ee <= 0.5\ngini = 0.4414\nsamples = 73\nvalue = [49, 24]"] ;
3 -> 4 ;
5 [label="substring=oy <= 0.5\ngini = 0.4027\nsamples = 68\nvalue = [49, 19]"] ;
4 -> 5 ;
6 [label="substring=im <= 0.5\ngini = 0.3589\nsamples = 64\nvalue = [49, 15]"] ;
5 -> 6 ;
7 [label="lastLetter-firstName=w <= 0.5\ngini = 0.316\nsamples = 61\nvalue = [49, 12]"] ;
6 -> 7 ;
8 [label="firstLetter-firstName=w <= 0.5\ngini = 0.2815\nsamples = 59\nvalue = [49, 10]"] ;
7 -> 8 ;
9 [label="substring=sa <= 0.5\ngini = 0.2221\nsamples = 55\nvalue = [48, 7]"] ;
... many many more lines below
KubiK888
  • 4,377
  • 14
  • 61
  • 115

2 Answers2

5
  1. In Python you can use DecisionTreeClassifier.feature_importances_, which according to the documentation contains

    The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [R66].

    Simply do a np.argsort on the feature importances and you get a feature ranking (ties are not accounted for).

  2. You can look at the Gini impurity (\ngini in the graphviz output) to get a first idea. Lower is better. However, be aware that you will need a way to combine impurity values if a feature is used in more than one split. Typically, this is done by taking the average information gain (or 'purity gain') over all splits on a given feature. This is done for you if you use feature_importances_.

Edit: I see the problem goes deeper than I thought. The graphviz thing is merely a graphical representation of the tree. It shows the tree and every split of the tree in detail. This is a representation of the tree, not of the features. Informativeness (or importance) of the features does not really fit into this representation because it accumulates information over multiple nodes of the tree.

The variable classifierUsed2.feature_importances_ contains importance information for every feature. If you get for example [0, 0.2, 0, 0.1, ...] the first feature has an importance of 0, the second feature has an importance of 0.2, the third feature has an importance of 0, the fourth feature an importance of 0.1, and so on.

Let's sort features by their importance (most important first):

rank = np.argsort(classifierUsed2.feature_importances_)[::-1]

Now rank contains the indices of the features, starting with the most important one: [1, 3, 0, 1, ...]

Want to see the five most important features?

print(rank[:5])

This prints the indices. What index corresponds to what feature? That's something you should know yourself because you supposedly constructed the feature matrix. Chances are, that this works:

print(dv.get_feature_names()[rank[:5]])

Or maybe this:

print('\n'.join(dv.get_feature_names()[i] for i in rank[:5]))
MB-F
  • 22,770
  • 4
  • 61
  • 116
  • I added (print tree.feature_importances_) but it says 'module' object has no attribute 'feature_importances_' – KubiK888 Jan 19 '16 at 09:19
  • Sorry, I misread your code. I thought `tree` was the classifierobject. So in your case `classifierUsed2.feature_importances_` should work. – MB-F Jan 19 '16 at 09:26
  • I tried, but it just added lists like these [ 0. 0. 0. ..., 0. 0. 0.] or [ 0.00924365 0. 0. ..., 0. 0. 0. ] – KubiK888 Jan 19 '16 at 09:41
  • What's wrong with these? A value of 0 means that this feature was not useful (probably not used at all in the tree). – MB-F Jan 19 '16 at 09:49
  • am I supposed to replace this "tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file=treeFileName)" with this "tree.export_graphviz(classifierUsed2.feature_importances_, feature_names=dv.get_feature_names(), out_file=treeFileName)"? – KubiK888 Jan 19 '16 at 09:51
  • The problem is I don't know what these numbers are corresponding to which features. – KubiK888 Jan 19 '16 at 09:52
  • Why are the results (model.feature_importances_) different after running each time. The indices are different! What is the mathematical reasons for that ? in this case, how to trust this feature ranking. I have 205 features from 27 observations for a binary classification. Every time I run , I get different ranking . Is this method reliable? – Saman Jan 17 '17 at 23:52
  • @saman I think this question deserves to be asked separately. What is your model and what exactly do you mean by running? If you you are talking about new fits each time there may be reasons. You probably won't get mich useful Information out oft such data anyway... – MB-F Jan 18 '17 at 06:27
2

As kazemakase already pointed out you can get the most important features using the classifier.feature_importances_:

print(sorted(list(zip(classifierUsed2.feature_importances_, dv.get_feature_names()))))

Just as an addendum, I personally prefer the following printing structure (modified from this question/answer):

# Print Decision rules:
def print_decision_tree(tree, feature_names):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, indent=""):
        if (threshold[node] != -2):
            print (indent+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
            if left[node] != -1:
                recurse (left, right, threshold, features,left[node],indent+"   ")
            print (indent+"} else {")
            if right[node] != -1:
                recurse (left, right, threshold, features,right[node],indent+"   ")
            print (indent+"}")
        else:
            print (indent+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0)

# Use it like this:
print_decision_tree(classifierUsed2, dv.get_feature_names())
Community
  • 1
  • 1
Robin Spiess
  • 1,480
  • 9
  • 17
  • Interesting, I wonder how I can interpret this zig zag tree. Are the top 5 features always the 5 most informative features? Or the next most informative occurs when there is an "else" turn? – KubiK888 Jan 19 '16 at 17:15
  • @KubiK888 Most likely but I don't know for sure as well. I think the feature_importances_ is pretty reliable, so you could compare the importance with the position of the feature in the tree. Unfortunately, the Berkeley university site, where the background of the procedure to compute the importances is described, is currently not available (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm). – Robin Spiess Jan 20 '16 at 07:40