2

This question is similar to (and might be a simple extension of) the question linked here:

How to extract sklearn decision tree rules to pandas boolean conditions?

Solution from the above link is synthesized below:

First of all let's use the scikit documentation on decision tree structure to get information about the tree that was constructed:

n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
feature = clf.tree_.feature
threshold = clf.tree_.threshold

We then define two recursive functions. The first one will find the path from the tree's root to create a specific node (all the leaves in our case). The second one will write the specific rules used to create a node using its creation path:

def find_path(node_numb, path, x):
        path.append(node_numb)
        if node_numb == x:
            return True
        left = False
        right = False
        if (children_left[node_numb] !=-1):
            left = find_path(children_left[node_numb], path, x)
        if (children_right[node_numb] !=-1):
            right = find_path(children_right[node_numb], path, x)
        if left or right :
            return True
        path.remove(node_numb)
        return False


def get_rule(path, column_names):
    mask = ''
    for index, node in enumerate(path):
        #We check if we are not in the leaf
        if index!=len(path)-1:
            # Do we go under or over the threshold ?
            if (children_left[node] == path[index+1]):
                mask += "(df['{}']<= {}) \t ".format(column_names[feature[node]], threshold[node])
            else:
                mask += "(df['{}']> {}) \t ".format(column_names[feature[node]], threshold[node])
    # We insert the & at the right places
    mask = mask.replace("\t", "&", mask.count("\t") - 1)
    mask = mask.replace("\t", "")
    return mask

Finally, we use those two functions to first store the path of creation of each leaf. And then to store the rules used to create each leaf :

Leaves leave_id = clf.apply(X_test)

paths ={} for leaf in np.unique(leave_id):
    path_leaf = []
    find_path(0, path_leaf, leaf)
    paths[leaf] = np.unique(np.sort(path_leaf))

rules = {} for key in paths:
    rules[key] = get_rule(paths[key], pima.columns)

With the data you gave the output is:

rules = {3: "(df['insulin']<= 127.5) & (df['bp']<= 26.450000762939453) & (df['bp']<= 9.100000381469727)  ",  
4: "(df['insulin']<= 127.5) & (df['bp']<= 26.450000762939453) & (df['bp']> 9.100000381469`727)",  
6: "(df['insulin']<= 127.5) & (df['bp']> 26.450000762939453) & (df['skin']<= 27.5)  ",  
7: "(df['insulin']<= 127.5) & (df['bp']> 26.450000762939453 & (df['skin']> 27.5)  ",  
10: "(df['insulin']> 127.5) & (df['bp']<= 28.149999618530273) &(df['insulin']<= 145.5)  ",  
11: "(df['insulin']> 127.5) & (df['bp']<= 28.149999618530273) & (df['insulin']> 145.5)  ",  
13: "(df['insulin']> 127.5) & (df['bp']> 28.149999618530273) & (df['insulin']<= 158.5)  ",  
14: "(df['insulin']> 127.5) & (df['bp']> 28.149999618530273) & (df['insulin']> 158.5)  "}

Since the rules are strings, you can't directly call them using df[rules[3]], you have to use the eval function like so df[eval(rules[3])]

The solution posted above works great at finding the path for each termination node. I am wondering if its possible to store the path for every node (leaves, and termination nodes) in the exact same output format as in the above link (dictionary/list format).

Thanks!

1 Answers1

0

Ok so I figured out a solution to my question (although I don't believe its the best/most efficient way to do this), It also isn't the direct answer to my question (I am not storing the path for each individual node - simply creating a function to be able to parse through the stored information). It is the second part to the solution above and allows you to pull the subsetted data for the specific node you are looking for.

node_id = 3

def datatree_path_summarystats(node_id):
    for k, v in paths.items():
        if node_id in v:
            d = k,v

    ruleskey = d[0]
    numberofsteps = sum(map(lambda x : x<node_id, d[1]))

    for k, v in rules.items():
        if k == ruleskey:
            b = k,v

    stringsubset = b[1]

    datasubset = "&".join(stringsubset.split('&')[:numberofsteps])
    return datasubset

datasubset = datatree_path_summarystats(node_id)

df[eval(datasubset)]

This function runs through the paths that contain the node id you are looking for. It will then split the rule based on that number of nodes creating the logic to subset the dataframe based on that one specific node.