I am using sklearn DecisionTreeClassifier to predict between two classes.
clf = DecisionTreeClassifier(class_weight='balanced', random_state=SEED)
params = {'criterion':['gini','entropy'],
'max_leaf_nodes':[100,1000]
}
grid = GridSearchCV(estimator=clf,param_grid=params, cv=SKF,
scoring=scorer,
n_jobs=-1, verbose=5)
trans_df = pipe.fit_transform(df.drop(["out"], axis=1))
grid.fit(trans_df, df['out'].fillna(0))
I need to output the tree for analysis. No problem until there, I am going through all nodes and get the rules following more or less this answer.
def tree_to_flat(tree, feature_names):
tree_ = tree.tree_
feature_name = [
feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
for i in tree_.feature
]
positions = []
def recurse(node, depth, position=OrderedDict()):
indent = " " * depth
if tree_.feature[node] != _tree.TREE_UNDEFINED:
name = feature_name[node]
threshold = tree_.threshold[node]
lname = name
ldict = {key:value for (key,value) in position.items()}
ldict[lname] = '<=' + str(threshold)
rname = name
rdict = {key:value for (key,value) in position.items()}
rdict[rname] = '>' + str(threshold)
recurse(tree_.children_left[node], depth + 1, ldict)
recurse(tree_.children_right[node], depth + 1, rdict)
else:
position['value'] = tree_.value[node]
positions.append(position)
return position
recurse(0, 1)
return positions
If I look at the different values, they are all non integer, like [[296.727705967, 104.03070761]]
. The 104.03 is close to the number of instances in the node in total (104).
My understanding was that tree_.value[node]
gives the number of instances in the two classes. How can I end up with non integer numbers?
Thanks in advance