2

I am trying to code a two class classification DT problem that I used SAS EM before. But trying to do it in Sklearn. The target variable is a two class categorical variable. But there are a few continuous independent variables. In SAS I could specify the "Maximum Number of Branches" for each split. So when it is set to 4, some leaf will split into 2 and some in 4 (especially for continuous variables). I could not find an equivalent parameter in sklearn. Looked at "max_leaf-nodes". But that controls the total number of "leaf" nodes of the entire tree. I am sure some of you probably has faced the same situation and already found a solution. Please help/share. I will really appreciate it.

ArinB
  • 21
  • 2

1 Answers1

1

I don't think this option is available in sklearn, You will find this Post very useful for your Classification DT; as it lists all the options you have available.

I would recommend creating Bins for your continues variables; this way you force the branches to be the number of bins you have.

Example: For continuous variable COl1 has values between 1-100; you can create a 4 bins 1-25, 26-50 , 51-75, 76-100. or you can create the bins bases on the median.

momo1644
  • 1,769
  • 9
  • 25
  • Thank you so much momo1644 for taking time and answering my question. The DataAspirant link is certainly one of the best notes in building DTs in Python. The binning method will definitely be a kind of solution to this, but we lose the ability to optimally split on values determined by the algorithms. It is interesting that Sklearn would not have a way to control the max number of splits, because this is certainly an important split control of a DT. Well, probably this is a potential for future open source contribution by Python gurus in the sklearn package. – ArinB May 09 '18 at 01:17