1

I have a trained DecisionTreeClassifier instance and I am actually interested in the predicates of the underlying decision tree itself.So I need a clean way to traverse this tree.

Since the only official way to obtain a traversable representation is by exporting to a graphviz/dot file using scikit's export_graphviz function. After that I can parse and analyse the graph representation of the tree by using e.g. a combination of networkx and pydot.

But...

the content of my particular dot file is as follows:

digraph Tree {

node [shape=box] ;

0 [label="X[0] <= 15.0\ngini = 0.75\nsamples = 8\nvalue = [2, 2, 2, 2]"] ;

1 [label="X[1] <= 3.0\ngini = 0.5\nsamples = 4\nvalue = [2, 0, 2, 0]"] ;

0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;

2 [label="gini = 0.0\nsamples = 2\nvalue = [0, 0, 2, 0]"] ;

1 -> 2 ;

3 [label="gini = 0.0\nsamples = 2\nvalue = [2, 0, 0, 0]"] ;

1 -> 3 ;

4 [label="X[1] <= 3.0\ngini = 0.5\nsamples = 4\nvalue = [0, 2, 0, 2]"] ;

0 -> 4 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;

5 [label="gini = 0.0\nsamples = 2\nvalue = [0, 0, 0, 2]"] ;

4 -> 5 ;

6 [label="gini = 0.0\nsamples = 2\nvalue = [0, 2, 0, 0]"] ;

4 -> 6 ;

}

So this looks all fine and dandy but why are only the edges connected to the parent node properly labelled with a boolean value? Should not all edges in this graph have a proper boolean label/attribute attached to it??

Or if there is some weird graphviz/dot convention going one that helps me to tell apart subsequent sibling edges, what's the rule?

I have noticed from scikit's documentation on the decision tree classifier that the examplified rendered graphviz decision tree is actually also missing the boolean labels. As far as my insight into decision trees goes, this leaves out important information about the tree. Again are there any conventions that I am missing here? E.g. is a left edge always implicitly True? And how can I tell that from the dot file since it is organized vertically?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Yunus King
  • 1,141
  • 1
  • 11
  • 23
  • Can I suggest that you remove the `graphviz` and `dot` labels? `graphviz` only does what it is being told, and as long as the source code does not contain edge labels, it will not display anything, just as one would expect. – vaettchen Jun 08 '18 at 14:04
  • But why would I want to remove even more labels? The labels contain the actual relevant metadata for my tree. If anything, I want more labels, not less :) – Yunus King Jun 09 '18 at 06:10
  • Add or remove, the point is that you will have to do it manually if your code producing app doesn't do it for you. `graphviz` follows the instructions it is getting, your problem is on the level before. – vaettchen Jun 09 '18 at 06:47
  • Ah, ok. I see your point now. But I was only interested in the dot file because I thought it was the only official way to get a(n albeit serialized) representation of my tree. I didn't care about eventually rendering my tree with graphviz. I now understand there is a different pythonic way to get the structure out of the DecisionTreeClassifier. And yes, if I really want to, I can now add those extra boolean labels myself to the dot file. – Yunus King Jun 09 '18 at 07:19

1 Answers1

1

After accidentally stumbling upon an example on the scikit-learn website, I realized that I do not have to parse the exported dot file to get back a Python tree structure to represent my constructed decision tree. Apparently I can use the tree_ attribute of the DecisionTreeClassifier instance which is an exposed attribute according to the official API reference (all the way at the bottom) and it has a documented example on how to use this tree_ object here.

However it is pretty confusing - to me at least - that apparently this tree object is exposed as part of the DecisionTreeClassifier API and it has a documented example on how to use it in a particular way, but there is no officially published documentation of its underlying class sklearn.tree._tree.Tree. You just have to look into the source code.

Concerning the dot file, I am pretty sure now that its only purpose is just to render the decision tree. This conclusion is reaffirmed after looking into the source code of export_graphviz where I have noticed that it's indeed hard-coded to only pass the edge labels for the ones connected to the parent. export_graphviz is using the tree_ attribute of DecisionTreeClassifier. And from the way this attribute is being used, I think you can safely deduce that it always first writes out the 'True' edge before it writes out the 'False' edge for any node. IMHO this warrants a feature request to allow to label all edges given certain parameter flag.

Yunus King
  • 1,141
  • 1
  • 11
  • 23
  • Did you ever figure out the solution to this? – bernando_vialli Sep 17 '18 at 15:29
  • So I basically followed the idea exemplified in the above ['here' link](http://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py). [This StackOverflow post](https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree#22261053) also shows various ways of doing what I want. But when I first stumbled upon that post, I thought the solutions where hacks. But apparently this IS how scikit's exposes the internals of its tree object. – Yunus King Sep 28 '18 at 14:06