5

I'm wondering how I can extract feature importances from a Random Forest in scikit-learn with the feature names when using the classifier in a pipeline with preprocessing.

The question here deals with extracting only feature importance: How to extract feature importances from an Sklearn pipeline

From the brief research I've done, this doesn't seem to be possible in scikit-learn, but I hope I'm wrong.

I also found a package called ELI5 (https://eli5.readthedocs.io/en/latest/overview.html) that is supposed to fix that issue with scikit-learn, but it didn't solve my problem because the names of the features that were outputted for me were x1, x2, etc., not the actual feature names.

As a workaround, I did all my preprocessing outside the pipeline, but would love to know how to do it in the pipeline.

If I can provide any helpful code, let me know in the comments.

Python Developer
  • 551
  • 1
  • 8
  • 18
  • I guess this really depends what preprocessing you are talking about... Could you specify? – MaximeKan Mar 20 '19 at 01:21
  • From the documentation the feature_names option is avaliable for some functions. Hope it helps https://eli5.readthedocs.io/en/latest/_modules/eli5/explain.html?highlight=feature%20names – TavoGLC Mar 20 '19 at 05:16
  • Show the code that you are using and want to transform it to pipeline. – Vivek Kumar Mar 20 '19 at 07:39

1 Answers1

5

There is an example with Xgboost for getting feature importance:

num_transformer = Pipeline(steps=[
                  ('imputer', SimpleImputer(strategy='median')),
                  ('scaler', preprocessing.RobustScaler())])

cat_transformer = Pipeline(steps=[
                  ('imputer', SimpleImputer(strategy='most_frequent')),
                  ('onehot', preprocessing.OneHotEncoder(categories='auto', 
                                     sparse=False, 
                                     handle_unknown='ignore'))])

from sklearn.compose import ColumnTransformer

numerical_columns = X.columns[X.dtypes != 'category'].tolist()
categorical_columns = X.columns[X.dtypes == 'category'].tolist()

pipeline_procesado = ColumnTransformer(transformers=[
            ('numerical_preprocessing', num_transformer, numerical_columns),
       ('categorical_preprocessing', cat_transformer, categorical_columns)],
        remainder='passthrough',
        verbose=True)

# Create the classifier
classifier = XGBClassifier()

# Create the overall model as a single pipeline
pipeline = Pipeline([("transform_inputs", pipeline_procesado), ("classifier", 
classifier)])

pipeline.fit(X_train, y_train)

onehot_columns = pipeline.named_steps['transform_inputs'].named_transformers_['categorical_preprocessing'].named_steps['onehot'].get_feature_names(input_features=categorical_columns)


#you can get the values transformed with your pipeline
X_values = pipeline_procesado.fit_transform(X_train)

df_from_array_pipeline = pd.DataFrame(X_values, columns = numerical_columns + list(onehot_columns) )

feature_importance = pd.Series(data= pipeline.named_steps['classifier'].feature_importances_, index = np.array(numerical_columns + list(onehot_columns)))
  • Thanks. One comment: DEPRECIATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. – ranemak May 10 '23 at 14:44