0

I have a classification problem in Python. I want to find out what are the most important features for classification. My data is mixed, some columns are categorical values some are not categorical. I'm applying transformations with OneHotEncoder and Normalizer:

columns_for_vectorization = ['A', 'B', 'C', 'D', 'E']
columns_for_normalization = ['F', 'G', 'H']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_vectorization),
                                                        ('Normalizer', Normalizer(), columns_for_normalization)],
                                          remainder='passthrough') # Default is to drop untransformed columns

After that I'm splitting my data and I'm doing transformation:

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.25, random_state=0)

x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

Then, Im training my model:

clf = RandomForestClassifier(max_depth = 5, n_estimators = 50, random_state = 0)
model = clf.fit(x_train, y_train)

And I'm printing the best features:

print(model.feature_importances_)

I'm getting results like this:

[1.40910562e-03 1.46133832e-03 4.05058130e-03 3.92205197e-03
 2.13243521e-03 5.78555893e-03 1.51927254e-03 1.14987114e-03
 ...
 6.37840204e-04 7.21061812e-04 5.77726129e-04 5.32382587e-04]

The problem is, that in beginning, I had 8 features, but because of the transformation, I have more than 20 features (because of the categorical data) How can I handle this? How can I know what beginning feature is the most important?

taga
  • 3,537
  • 13
  • 53
  • 119
  • 1
    Have you tried this answer? https://stackoverflow.com/questions/54646709/sklearn-pipeline-get-feature-names-after-onehotencode-in-columntransformer – user3252344 Jun 09 '21 at 03:11
  • That answer shows how to access it if I have transformer and classifier in pipeline. How can I do it in my case? – taga Jun 09 '21 at 07:01

2 Answers2

1

Try the following to get the names of features that were treated by 'Vector Cat' transformer:

VectorCatNames = list(transformerVectoriser.transformers_[0][1]['Vector Cat'].get_feature_names(columns_for_vectorization))

Then, names of your final features can be saved as:

feature_names = VectorCatNames + columns_for_normalization
Rafa
  • 564
  • 4
  • 12
0

This github gist seems to say you can get the columns after fit/transform with:

numeric_features = X.select_dtypes(np.number).columns

enc_cat_features = transformerVectorizer.named_transformers_['Vector cat'].get_feature_names()
labels = np.concatenate([numeric_features, enc_cat_features])
transformed_df_X = pd.DataFrame(preprocessor.transform(X_train).toarray(), columns=labels)
# To access your data - transformed_df_X
# To access your columns - transformed_df_X.columns

If you can't seem to get it working through the ColumnTransformer due to 'subscriptable' errors, you can definitely do this directly on the OneHotEncoder object.

Typically I also mess with the names afterwards because OneHotEncoder gives ugly automatic names.

Anyway then once you can access the X.columns thing, you can do whatever you like with the feature importances. My sample code for plotting them with the feature names uses permutation_importance instead but apparently feature_importance gives the same structure so you might have some luck and this will work for you.

from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

def plot_feature_importance(model, X_train, y_train, feature_names):
   result = permutation_importance(model, X_train, y_train, n_repeats=10)
   perm_sorted_idx = result.importances_mean.argsort()

   fig, ax2 = plt.subplots(1, 1, figsize=(5, 15))
   ax2.boxplot(result.importances[perm_sorted_idx].T, vert=False,
               labels=feature_names[perm_sorted_idx])
   fig.tight_layout()
   plt.show()

On the UCI ML horse colic set with a random forest, this gave me an output image with the categorical & numerical names like this:

Box plot showing features AbdomenClass4, RectTemp, AbdominalDistension3, CapillaryRefillTime1, PeripheralPulse3 as key features

user3252344
  • 678
  • 6
  • 14