Name of variables in sklearn pipeline

Question

I need to use DecisionTreeClassifier from sklearn library. There are multiple columns in my dataset which I have to dummy. My problem is that I have variable names in the resulting models non-speaking names of feature_1, feature_2, ..., feature_n. How do I give them real names? I work with a dataset with about 400 columns, so manual renaming is not an ideal way. Thank you.

import pandas as pd

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, cross_val_score
from yellowbrick.model_selection import RFECV


raw_data = {'sum': [2345, 256,  43, 643, 34 , 23, 95], 
        'department': ['a1', 'a1', 'a3', 'a3', 'a1', 'a2', 'a2'],
        'sex': ['m', 'neudane', 'f', '', 'f', 'f', 'f']}
df = pd.DataFrame(raw_data, columns = ['sum', 'department', 'sex'])

y = {'y': ['cat_a', 'cat_a', 'cat_b', 'cat_c', 'cat_b', 'cat_a', 'cat_a']}

y = pd.DataFrame(y, columns = ['y'])


categorical = ['department', 'sex']

numerical = ['sum']


X = df[categorical + numerical]


categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(sparse=True, handle_unknown="ignore"))
])

numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])



basic_preprocessor = ColumnTransformer([
    #("nominal_preprocessor", nominal_pipeline, nominal),
    ("categorical_preprocessor", categorical_pipeline, categorical),
    ("numerical_preprocessor", numerical_pipeline, numerical)
])


preprocessed = basic_preprocessor.fit_transform(X)


X = preprocessed


from sklearn.model_selection import train_test_split
train, test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn import tree
from sklearn.tree import export_text
clf = tree.DecisionTreeClassifier()
clf = clf.fit(train, y_train)


r = export_text(clf)
print(r)



>>>r = export_text(clf)
>>>print(r)
|--- feature_1 <= 0.50
|   |--- feature_7 <= -0.19
|   |   |--- class: cat_b
|   |--- feature_7 >  -0.19
|   |   |--- class: cat_c
|--- feature_1 >  0.50
|   |--- class: cat_a

MrDrFenner · Answer 1 · 2022-01-20T18:28:42.593

Two key components can help make this work. The first gets the encoding names from the OneHotEncoder: OneHotEncoder.get_feature_names_out. Specifically, you use that on your encoder as encoder.get_feature_names_out(). The second component is that sklearn.tree.export_text takes a feature_names argument. So, you can pass those extracted names right into the display system. Other sklearn tree displayers also take that parameter (plot_tree, export_graphviz).

See here for related SO:

sklearn docs here:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree (follow those links for the tree export/plot functions).

The following should work for you (Edit: I forget the pipeline part in my example. You can use my_pipe.named_steps[step_name] to extract out the OneHotEncoder. You may have to nest that since you have nested pipelines. Added that example below.):

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier, export_text

import sklearn
print(sklearn.__version__)  # ---> 1.0.2 for me

ftrs = pd.DataFrame({'Sex'     : ['male', 'female']*3, 
                     'AgeGroup': ['0-20', '0-20', 
                                  '20-60', '20-60',
                                  '80+', '80+']})
tgt  = np.array([1, 1, 1, 1, 0, 1])
encoder = OneHotEncoder()
enc_ftrs = encoder.fit_transform(ftrs)
dtc = DecisionTreeClassifier().fit(enc_ftrs, tgt)

encoder_names = encoder.get_feature_names_out()
print(export_text(dtc, feature_names = list(encoder_names)))

Which for me gives the following output:

|--- AgeGroup_80+ <= 0.50
|   |--- class: 1
|--- AgeGroup_80+ >  0.50
|   |--- Sex_female <= 0.50
|   |   |--- class: 0
|   |--- Sex_female >  0.50
|   |   |--- class: 1

Including the pipeline, it looks like this:

from sklearn.pipeline import Pipeline
pipe = Pipeline([('enc', OneHotEncoder()),
                 ('dtc', DecisionTreeClassifier())])
pipe.fit(ftrs, tgt)
feature_names = list(pipe.named_steps['enc'].get_feature_names_out())
print(export_text(pipe.named_steps['dtc'],
                  feature_names = feature_names))

with the same output.

Name of variables in sklearn pipeline

1 Answers1