6

I want to match the output np array with the features to make a new pandas dataframe

Here is my pipeline:

from sklearn.pipeline import Pipeline
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
     ('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
     ('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
     (continuous_preprocessing, continuous_cols),
     (categorical_preprocessing, categorical_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)

Here is how I call it:

X_train = pipeline.fit_transform(X_train)
X_val = pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

Here is what I get when trying to get the feature names:

pipeline['Preprocessing'].transformers_[1][1]['Ordinal encoding'].get_feature_names()

OUT:

AttributeError: 'OrdinalEncoder' object has no attribute 'get_feature_names'

Here is a SO question that was similar: Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

Kevin
  • 3,077
  • 6
  • 31
  • 77

1 Answers1

6

Point is that, as of today, some transformers do expose a method .get_feature_names_out() and some others do not, which generates some problems - for instance - whenever you want to create a well-formatted DataFrame from the np.array outputted by a Pipeline or ColumnTransformer instance. (Instead, afaik, .get_feature_names() was deprecated in latest versions in favor of .get_feature_names_out()).

For what concerns the transformers that you are using, StandardScaler belongs to the first category of transformers exposing the method, while both SimpleImputer and OrdinalEncoder do belong to the second. The docs show the exposed methods within the Methods paragraphs. As said, this causes problems when doing something like pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out()) on your pipeline, but it would cause problems as well on your categorical_preprocessing and continuous_preprocessing pipelines (as in both cases at least one transformer lacks of the method) and on the preprocessing ColumnTransformer instance.

There's an ongoing attempt in sklearn to enrich all estimators with the .get_feature_names_out() method. It is tracked within github issue #21308, which, as you might see, branches in many PRs (each one dealing with a specific module). For instance, issue #21079 for the preprocessing module, which will enrich the OrdinalEncoder among the others, issue #21078 for the impute module, which will enrich the SimpleImputer. I guess that they'll be available in a new release as soon as all the referenced PR will be merged.

In the meanwhile, imo, you should go with a custom solution that might fit your needs. Here's a simple example, which do not necessarily resemble your need, but which is meant to give a (possible) way of proceeding:

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector

X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
                  'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
                  'expert_rating': [5, 3, 4, 5, np.NaN],
                  'user_rating': [4, 5, 4, np.NaN, 3]})
X

enter image description here

num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(exclude=np.number).columns.tolist()

# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(missing_values='', strategy='most_frequent')),
    ('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
    (continuous_preprocessing, num_cols),
    (categorical_preprocessing, cat_cols),
)

# Final pipeline
pipeline = Pipeline(
    [('Preprocessing', preprocessing)]
)

X_trans = pipeline.fit_transform(X)

pd.DataFrame(X_trans, columns= num_cols + cat_cols)

enter image description here

amiola
  • 2,593
  • 1
  • 11
  • 25
  • 1
    Thank you, I did something similar where I just assigned the names of the columns to a newly created df – Kevin Feb 09 '22 at 20:07