1

There is well-know problem in Tom's Mitchell Machine Learning book to build decision tree based on the following data, where Play ball is the target variable.

enter image description here

The resulting tree is following

enter image description here

I wonder whether it's possible to build this tree with scikit-learn. I found several examples where decision tree can be depicted as

export_graphviz(clf) 
Source(export_graphviz(clf, out_file=None))

However it looks like scikit doesn't work well with categorical data, the data has to be binarized into several columns. So as result, it is impossible to build the tree exactly as in the picture. Is it correct?

com
  • 2,606
  • 6
  • 29
  • 44
  • Text form of the data available? – Bharath M Shetty Dec 01 '17 at 04:20
  • 1
    Label encoding? – Adorn Dec 01 '17 at 04:57
  • I have not used it myself, but a quick search result is here, from the comments, it looks promising. https://github.com/scikit-learn/scikit-learn/pull/4899 – Adorn Dec 01 '17 at 04:58
  • As @Adorn has said, you can encode categorial variables to one-hot encoded terms and then run scikit and check the results. You just need to interpret the results right way in that. – Vivek Kumar Dec 01 '17 at 05:55
  • 2
    Possible duplicate of [Passing categorical data to Sklearn Decision Tree](https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree) – Imran Dec 01 '17 at 06:55
  • @Adorn, the problem is that only OneHotEncoder can be used, any numerical value will be treated as numerical variable and not categorical. OneHotEncoder creates dummy binarized columns. So basically it's impossible to get a split into three values (like in the pic) with scikit-learn. – com Dec 01 '17 at 07:25

1 Answers1

3

Yes, it is correct that it is impossible to build such a tree with scikit-learn.

The primary reason is that this is a ternary tree (nodes with up to three children) but scikit-learn implements only binary trees - nodes have exactly two or no children:

cdef class Tree:
    """Array-based representation of a binary decision tree.
...

However, it is possible to get an equivalent binary tree of the form

Outlook == Sunny
    true  => Humidity == High
        true  => no
        false => yes      
    false => Outlook == Overcast
        true  => yes
        false => Wind == Strong
            true  => no
            false => yes 
MB-F
  • 22,770
  • 4
  • 61
  • 116