Get names of most important features for classification after transformation

Question

I have a classification problem in Python. I want to find out what are the most important features for classification. My data is mixed, some columns are categorical values some are not categorical. I'm applying transformations with OneHotEncoder and Normalizer:

columns_for_vectorization = ['A', 'B', 'C', 'D', 'E']
columns_for_normalization = ['F', 'G', 'H']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_vectorization),
                                                        ('Normalizer', Normalizer(), columns_for_normalization)],
                                          remainder='passthrough') # Default is to drop untransformed columns

After that I'm splitting my data and I'm doing transformation:

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.25, random_state=0)

x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

Then, Im training my model:

clf = RandomForestClassifier(max_depth = 5, n_estimators = 50, random_state = 0)
model = clf.fit(x_train, y_train)

And I'm printing the best features:

print(model.feature_importances_)

I'm getting results like this:

[1.40910562e-03 1.46133832e-03 4.05058130e-03 3.92205197e-03
 2.13243521e-03 5.78555893e-03 1.51927254e-03 1.14987114e-03
 ...
 6.37840204e-04 7.21061812e-04 5.77726129e-04 5.32382587e-04]

The problem is, that in beginning, I had 8 features, but because of the transformation, I have more than 20 features (because of the categorical data) How can I handle this? How can I know what beginning feature is the most important?

Have you tried this answer? https://stackoverflow.com/questions/54646709/sklearn-pipeline-get-feature-names-after-onehotencode-in-columntransformer — user3252344, Jun 09 '21 at 03:11
That answer shows how to access it if I have transformer and classifier in pipeline. How can I do it in my case? — taga, Jun 09 '21 at 07:01

score 1 · Accepted Answer · answered Jun 09 '21 at 07:51

1

Try the following to get the names of features that were treated by 'Vector Cat' transformer:

VectorCatNames = list(transformerVectoriser.transformers_[0][1]['Vector Cat'].get_feature_names(columns_for_vectorization))

Then, names of your final features can be saved as:

feature_names = VectorCatNames + columns_for_normalization

answered Jun 09 '21 at 07:51

Rafa

564
4
12

Im getting this error: OneHotEncoder' object is not subscriptable – taga Jun 09 '21 at 08:24

score 0 · Answer 2 · answered Jun 10 '21 at 04:41

This github gist seems to say you can get the columns after fit/transform with:

numeric_features = X.select_dtypes(np.number).columns

enc_cat_features = transformerVectorizer.named_transformers_['Vector cat'].get_feature_names()
labels = np.concatenate([numeric_features, enc_cat_features])
transformed_df_X = pd.DataFrame(preprocessor.transform(X_train).toarray(), columns=labels)
# To access your data - transformed_df_X
# To access your columns - transformed_df_X.columns

If you can't seem to get it working through the ColumnTransformer due to 'subscriptable' errors, you can definitely do this directly on the OneHotEncoder object.

Typically I also mess with the names afterwards because OneHotEncoder gives ugly automatic names.

Anyway then once you can access the X.columns thing, you can do whatever you like with the feature importances. My sample code for plotting them with the feature names uses permutation_importance instead but apparently feature_importance gives the same structure so you might have some luck and this will work for you.

from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

def plot_feature_importance(model, X_train, y_train, feature_names):
   result = permutation_importance(model, X_train, y_train, n_repeats=10)
   perm_sorted_idx = result.importances_mean.argsort()

   fig, ax2 = plt.subplots(1, 1, figsize=(5, 15))
   ax2.boxplot(result.importances[perm_sorted_idx].T, vert=False,
               labels=feature_names[perm_sorted_idx])
   fig.tight_layout()
   plt.show()

On the UCI ML horse colic set with a random forest, this gave me an output image with the categorical & numerical names like this:

Get names of most important features for classification after transformation

2 Answers2