I am trying to extract decision rules to predict terminal nodes and to print code that would use pandas numpy arrays to predict the terminal node numbers. I found a solution that can pull the rules at (How to extract the decision rules from scikit-learn decision-tree?), but I am not sure how to expand it to produce what I need. The link to the solution has a lot of answers. Here is the one I am referring to and description of the question.
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
# dummy data:
df = pd.DataFrame({'col1':[0,1,2,3],'col2':[3,4,5,6],'dv':[0,1,0,1]})
df
# create decision tree
dt = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_leaf=1)
dt.fit(df.loc[:,('col1','col2')], df.dv)
#This function first starts with the nodes (identified by -1 in the child arrays) and then recursively finds the parents.
#I call this a node's 'lineage'. Along the way, I grab the values I need to create if/then/else SAS logic:
def get_lineage(tree, feature_names):
left = tree.tree_.children_left
right = tree.tree_.children_right
threshold = tree.tree_.threshold
features = [feature_names[i] for i in tree.tree_.feature]
# get ids of child nodes
idx = np.argwhere(left == -1)[:,0]
def recurse(left, right, child, lineage=None):
if lineage is None:
lineage = [child]
if child in left:
parent = np.where(left == child)[0].item()
split = 'l'
else:
parent = np.where(right == child)[0].item()
split = 'r'
lineage.append((parent, split, threshold[parent], features[parent]))
if parent == 0:
lineage.reverse()
return lineage
else:
return recurse(left, right, parent, lineage)
for child in idx:
for node in recurse(left, right, child):
print (node)
get_lineage(dt, df.columns)
when you run the code, it will provide this:
(0, 'l', 3.5, 'col2')
1
(0, 'r', 3.5, 'col2')
(2, 'l', 1.5, 'col1')
3
(0, 'r', 3.5, 'col2')
(2, 'r', 1.5, 'col1')
(4, 'l', 2.5, 'col1')
5
(0, 'r', 3.5, 'col2')
(2, 'r', 1.5, 'col1')
(4, 'r', 2.5, 'col1')
6
How can I expand it to print something like this:
df['Terminal_Node_Num']=np.where(df.loc[:,'col2']<=3.5,1,0)
df['Terminal_Node_Num']=np.where(((df.loc[:,'col2']>3.5) & (df.loc[:,'col1']
<=1.5)), 3, df['Terminal_Node_Num'])
df['Terminal_Node_Num']=np.where(((df.loc[:,'col2']>3.5) &
(df.loc[:,'col1']>1.5) & (df.loc[:,'col1']<=2.5)), 5,
df['Terminal_Node_Num'])
df['Terminal_Node_Num']=np.where(((df.loc[:,'col2']>3.5)`enter code here`(df.loc[:,'col1']>1.5) & (df.loc[:,'col1']>2.5)), 6, df['Terminal_Node_Num'])