How to convert a fitted scikit-learn Decision Tree model to a tabular format?

Question

I am working with a Decision Tree model (sklearn.tree.DecisionTreeRegressor) and I would like to look at the detailed structure of the tree itself. I am currently using matplotlib.pyplot.figure and tree.export_text to output the tree however neither of these meets my requirements.

I would like to output the tree as a table with 1 row for each node in the tree. Suppose the tree looks like the following:

                 Node 1
               /       \
              /         \
             /           \
        Node 2.1        Node 2.2
          /  \             /  \
         /    \           /    \
 Node 3.1   Node 3.2  Node 3.3  Node 3.4

Then I would like to produce a table with the following rows and columns.

Node	Variable	Threshold	Value	MSE	Samples
1
2.1
2.2
3.1
3.2
3.3
3.4

I know there is the tree_ attribute which could help. However I am not familiar with it and not sure where to start.

https://stackoverflow.com/a/55532629/7471846 I think this answer will work for you. It won't generate a tabular form but all the information you are asking for is in there. Also you could simply try using `tree.plot_tree()`. Does this give you what you want? — Sashi, Jul 18 '23 at 10:59
@Sashi How is that in a tabluar form? OP asks for a tablular form — DataJanitor, Jul 18 '23 at 11:06
@DataJanitor Yes. I know that. **I would like to look at the detailed structure of the tree itself**. Since he has said this, I thought that what he really wants is someway to look at the structure of the decision tree. So just giving him an option to look into, cause it has all the info he wants. — Sashi, Jul 18 '23 at 11:11

DataJanitor · Accepted Answer · 2023-07-18T14:22:39.140

You can use this piece of code, it traverses all the nodes and collects/calculates the pieces of information you are interested in. However, I felt free to modify the columns a little and change the enumeration of the nodes:

                 1
               /   \
             /       \
         1.1           1.2
       /   \          /    \
     /       \      /        \
1.1.1       1.1.2  1.2.1      1.2.2

Function

import pandas as pd
from sklearn.tree import DecisionTreeRegressor

def decision_tree_to_tabular(clf, feature_names):
    total_samples = clf.tree_.n_node_samples[0]  # total number of samples at the root node

    tabular_tree = {
        "Node": [],
        "Depth": [],  
        "Type": [],  
        "Splitting Feature": [],
        "Splitting Threshold": [],
        "Prediction": [],
        "MSE": [],
        "Number of Samples": [],
        "Proportion of Total Samples": [],
        "Proportion of Parent Samples": []
    }

    def traverse_nodes(node_id=0, parent_node_id=None, parent_samples=None, current_node_id='1', depth=1):
        samples = clf.tree_.n_node_samples[node_id]
        prop_total_samples = samples / total_samples
        prop_parent_samples = samples / parent_samples if parent_samples else None

        if clf.tree_.children_left[node_id] != clf.tree_.children_right[node_id]:  # internal node
            tabular_tree["Node"].append(current_node_id)
            tabular_tree["Depth"].append(depth)
            tabular_tree["Type"].append("Node")
            tabular_tree["Splitting Feature"].append(feature_names[clf.tree_.feature[node_id]])
            tabular_tree["Splitting Threshold"].append(clf.tree_.threshold[node_id])
            tabular_tree["Prediction"].append(None)
            tabular_tree["MSE"].append(clf.tree_.impurity[node_id])
            tabular_tree["Number of Samples"].append(samples)
            tabular_tree["Proportion of Total Samples"].append(prop_total_samples)
            tabular_tree["Proportion of Parent Samples"].append(prop_parent_samples)

            traverse_nodes(clf.tree_.children_left[node_id], current_node_id, samples, current_node_id + ".1", depth + 1)  # left child
            traverse_nodes(clf.tree_.children_right[node_id], current_node_id, samples, current_node_id + ".2", depth + 1)  # right child
        else:  # leaf
            tabular_tree["Node"].append(current_node_id)
            tabular_tree["Depth"].append(depth)
            tabular_tree["Type"].append("Leaf")
            tabular_tree["Splitting Feature"].append(None)
            tabular_tree["Splitting Threshold"].append(None)
            tabular_tree["Prediction"].append(clf.tree_.value[node_id].mean())
            tabular_tree["MSE"].append(clf.tree_.impurity[node_id])
            tabular_tree["Number of Samples"].append(samples)
            tabular_tree["Proportion of Total Samples"].append(prop_total_samples)
            tabular_tree["Proportion of Parent Samples"].append(prop_parent_samples)

    traverse_nodes()
    return pd.DataFrame(tabular_tree)

Test

from sklearn.datasets import fetch_california_housing

# Load the dataset
california = fetch_california_housing()
X = california.data
y = california.target
feature_names = california.feature_names

# Train a DecisionTreeRegressor
clf = DecisionTreeRegressor(random_state=0, max_depth=2).fit(X, y)

# Get the tree as a DataFrame
tabular_tree = decision_tree_to_tabular(clf, feature_names)

# display(tabular_tree)
display(tabular_tree.sort_values("Depth"))

Output

Node	Depth	Type	Splitting Feature	Splitting Threshold	Prediction	MSE	Number of Samples	Proportion of Total Samples	Proportion of Parent Samples
1	1	Node	MedInc	5.03515	NaN	1.331550	20640	1.000000	NaN
1.1	2	Node	MedInc	3.07430	NaN	0.837354	16255	0.787548	0.787548
1.2	2	Node	MedInc	6.81955	NaN	1.220713	4385	0.212452	0.212452
1.1.1	3	Leaf	None	NaN	1.356930	0.561155	7860	0.380814	0.483544
1.1.2	3	Leaf	None	NaN	2.088733	0.836995	8395	0.406734	0.516456
1.2.1	3	Leaf	None	NaN	2.905507	0.890550	3047	0.147626	0.694869
1.2.2	3	Leaf	None	NaN	4.216431	0.778440	1338	0.064826	0.305131

How to convert a fitted scikit-learn Decision Tree model to a tabular format?

1 Answers1

Function

Test

Output

Linked