How to print the decision path of a random forest with feature names in pyspark?

Question

how can I modify the code to print the decision path with features names rather than just numbers.

import pandas as pd
import pyspark.sql.functions as F
from pyspark.ml import Pipeline, Transformer
from pyspark.sql import DataFrame
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler

data = pd.DataFrame({
    'ball': [0, 1, 2, 3],
    'keep': [4, 5, 6, 7],
    'hall': [8, 9, 10, 11],
    'fall': [12, 13, 14, 15],
    'mall': [16, 17, 18, 10],
    'label': [21, 31, 41, 51]
})

df = spark.createDataFrame(data)

assembler = VectorAssembler(
    inputCols=['ball', 'keep', 'hall', 'fall'], outputCol='features')
dtc = DecisionTreeClassifier(featuresCol='features', labelCol='label')

pipeline = Pipeline(stages=[assembler, dtc]).fit(df)
transformed_pipeline = pipeline.transform(df)

ml_pipeline = pipeline.stages[1]
print(ml_pipeline.toDebugString)

Output:

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_48b3a34f6fb1f1338624) of depth 3 with 7 nodes   If (feature 0 <= 0.5)    Predict: 21.0   Else (feature 0 >
0.5)    If (feature 0 <= 1.5)
    Predict: 31.0    Else (feature 0 > 1.5)
    If (feature 0 <= 2.5)
     Predict: 41.0
    Else (feature 0 > 2.5)
     Predict: 51.0

score 2 · Accepted Answer · answered Aug 01 '18 at 13:57

One option would be to manually replace the text in the string. We can do this by storing the values we pass as inputCols in a list input_cols, and then each time replacing the pattern feature i with the ith element of the list input_cols.

import pyspark.sql.functions as F
from pyspark.ml import Pipeline, Transformer
from pyspark.sql import DataFrame
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
import pandas as pd

data = pd.DataFrame({
    'ball': [0, 1, 2, 3],
    'keep': [4, 5, 6, 7],
    'hall': [8, 9, 10, 11],
    'fall': [12, 13, 14, 15],
    'mall': [16, 17, 18, 10],
    'label': [21, 31, 41, 51]
})

df = spark.createDataFrame(data)

input_cols = ['ball', 'keep', 'hall', 'fall']
assembler = VectorAssembler(
    inputCols=input_cols, outputCol='features')
dtc = DecisionTreeClassifier(featuresCol='features', labelCol='label')

pipeline = Pipeline(stages=[assembler, dtc]).fit(df)
transformed_pipeline = pipeline.transform(df)

ml_pipeline = pipeline.stages[1]

string = ml_pipeline.toDebugString
for i, feat in enumerate(input_cols):
    string = string.replace('feature ' + str(i), feat)
print(string)

Output:

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4eb084167f2ed4b671e8) of depth 3 with 7 nodes
  If (ball <= 0.0)
   Predict: 21.0
  Else (ball > 0.0)
   If (ball <= 1.0)
    Predict: 31.0
   Else (ball > 1.0)
    If (ball <= 2.0)
     Predict: 41.0
    Else (ball > 2.0)
     Predict: 51.0

Hope this helps!

you are a star ! Thank you. I posted a similar question here if you would like to have a look https://stackoverflow.com/questions/51614077/how-to-print-the-decision-path-of-a-specific-row-in-pyspark — PolarBear10, Aug 01 '18 at 14:12
@Matthew glad I could help. I sadly do not know of an easy way to solve the problem stated there. Only thing I consist of is to make some regex to extract the rules, and then evaluate the conditions. But that sounds quite complicated. — Florian, Aug 02 '18 at 09:20

score 1 · Answer 2 · answered Jul 20 '21 at 15:35

@Florian: the above code will not work when the number of features is large (more than 9). Instead, please use the following using the regular expression.

tree_to_json = mod.stages[-1].toDebugString
for (index, feat) in index_feature_name_tuple:
  pattern = '\((?P<index>feature ' + str(index) + ')' + ' (?P<rest>.*)\)'
  tree_to_json = re.sub(pattern, f'({feat} \g<rest>)', tree_to_json)

print(tree_to_json)

The tree_to_json is the raw rules which should be transferred to rules with the feature names. index_feature_name_tuple is the list of tuples where the first element of each tuple is the index of the feature and the second one stands for the name of the feature. You can get index_feature_name_tuple using the following script:

df_fitted.schema['features'].metadata["ml_attr"]["attrs"]

where df_fitted is the data frame transformed after you fitted the pipeline to the data frame.

How to print the decision path of a random forest with feature names in pyspark?

2 Answers2