I have a trained DecisionTreeClassifier
instance and I am actually interested in the predicates of the underlying decision tree itself.So I need a clean way to traverse this tree.
Since the only official way to obtain a traversable representation is by exporting to a graphviz/dot file using scikit's export_graphviz
function. After that I can parse and analyse the graph representation of the tree by using e.g. a combination of networkx and pydot.
But...
the content of my particular dot file is as follows:
digraph Tree {
node [shape=box] ;
0 [label="X[0] <= 15.0\ngini = 0.75\nsamples = 8\nvalue = [2, 2, 2, 2]"] ;
1 [label="X[1] <= 3.0\ngini = 0.5\nsamples = 4\nvalue = [2, 0, 2, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="gini = 0.0\nsamples = 2\nvalue = [0, 0, 2, 0]"] ;
1 -> 2 ;
3 [label="gini = 0.0\nsamples = 2\nvalue = [2, 0, 0, 0]"] ;
1 -> 3 ;
4 [label="X[1] <= 3.0\ngini = 0.5\nsamples = 4\nvalue = [0, 2, 0, 2]"] ;
0 -> 4 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
5 [label="gini = 0.0\nsamples = 2\nvalue = [0, 0, 0, 2]"] ;
4 -> 5 ;
6 [label="gini = 0.0\nsamples = 2\nvalue = [0, 2, 0, 0]"] ;
4 -> 6 ;
}
So this looks all fine and dandy but why are only the edges connected to the parent node properly labelled with a boolean value? Should not all edges in this graph have a proper boolean label/attribute attached to it??
Or if there is some weird graphviz/dot convention going one that helps me to tell apart subsequent sibling edges, what's the rule?
I have noticed from scikit's documentation on the decision tree classifier that the examplified rendered graphviz decision tree is actually also missing the boolean labels. As far as my insight into decision trees goes, this leaves out important information about the tree. Again are there any conventions that I am missing here? E.g. is a left edge always implicitly True? And how can I tell that from the dot file since it is organized vertically?