1

I am using sklearn DecisionTreeClassifier to predict between two classes.

clf = DecisionTreeClassifier(class_weight='balanced', random_state=SEED)
params = {'criterion':['gini','entropy'],
       'max_leaf_nodes':[100,1000]
       }
grid = GridSearchCV(estimator=clf,param_grid=params, cv=SKF,
                    scoring=scorer,
                    n_jobs=-1, verbose=5)
trans_df = pipe.fit_transform(df.drop(["out"], axis=1))
grid.fit(trans_df, df['out'].fillna(0))

I need to output the tree for analysis. No problem until there, I am going through all nodes and get the rules following more or less this answer.

def tree_to_flat(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    positions = []
    def recurse(node, depth, position=OrderedDict()):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            lname = name
            ldict = {key:value for (key,value) in position.items()}
            ldict[lname] = '<=' + str(threshold)
            rname = name
            rdict = {key:value for (key,value) in position.items()}
            rdict[rname] = '>' + str(threshold)
            recurse(tree_.children_left[node], depth + 1, ldict)
            recurse(tree_.children_right[node], depth + 1, rdict)                  

        else:
            position['value'] = tree_.value[node] 
            positions.append(position)
        return position

    recurse(0, 1)
    return positions

If I look at the different values, they are all non integer, like [[296.727705967, 104.03070761]]. The 104.03 is close to the number of instances in the node in total (104).

My understanding was that tree_.value[node] gives the number of instances in the two classes. How can I end up with non integer numbers?

Thanks in advance

Extratoro
  • 15
  • 3
  • Maybe this explains what you observe: https://github.com/scikit-learn/scikit-learn/issues/2703 "the value vector counts the weigthed number of samples of each class". – James King Nov 15 '17 at 02:57

0 Answers0