4

Problem setup: I have an imbalanced dataset where 98% of the data belongs to class A and 2% belongs to class B. I trained a DecisionTreeClassifier (from sklearn) with class_weights set to balance with the following settings:

dtc_settings = {
    'criterion': 'entropy',
    'min_samples_split': 100,
    'min_samples_leaf': 100,
    'max_features': 'auto',
    'max_depth': 5,
    'class_weight': 'balanced'
}

I have no reason for setting the criterion to entropy (rather than gini). I was just playing around with the settings.

I used tree's export_graphviz to get the decision tree diagram below. Here's the code that I used:

dot_data = tree.export_graphviz(dtc, out_file=None, feature_names=feature_col, proportion=False)
graph = pydot.graph_from_dot_data(dot_data)  
graph.write_pdf("test.pdf")

I'm confused on the value list output in the following diagram:

enter image description here

Does the value list variable mean that both classes have equal weight? If so, how is the value list computed for the subsequent nodes in the tree?

Here's another example where I set proportion to True in export_graphviz: enter image description here

I don't know how to interpret the value list. Are the entries class weights? Does this mean the classifier is applying those weights to each class respectively to determine the next threshold to use in the next node?

Bellerofont
  • 1,081
  • 18
  • 17
  • 16
OfLettersAndNumbers
  • 822
  • 1
  • 12
  • 22

1 Answers1

4

The list represents the count of records in each class that have reached that node. Depending on how you organized your target variable, the first value would represent the number of records of type A that reached that node and the 2nd value would be the number of records of type B that reached that node (or vice versa).

When proportion is set to True, it is now the fraction of records for each class that have reached that node.

The way a decision tree works is that it attempts to find the decision that will best segregate the classes. So it prefers decisions that would result in something like [0, 100] to something that results in [50, 50]

Metropolis
  • 2,018
  • 1
  • 19
  • 36
  • Thanks for answering. The number of records would only make sense if I had a balanced dataset. In this case 98% of the samples belong to class A and 2% belong to class B. I had set class_weight to 'balanced' which is setting each entry in value list the same. So I don't see how the fraction of records is the same when the data inherently doesn't have this distribution. – OfLettersAndNumbers Jan 20 '17 at 23:17
  • The claim there that entropy = 1 in that node is also suggesting the same thing. Entroy is 1 when the classes are balanced and 0 when they're all one class. Setting class_weight to 'balanced' will replicate the minority class until the two classes have equal representation http://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work – Metropolis Jan 21 '17 at 00:08