Point is that, as of today, some transformers do expose a method .get_feature_names_out()
and some others do not, which generates some problems - for instance - whenever you want to create a well-formatted DataFrame
from the np.array outputted by a Pipeline
or ColumnTransformer
instance. (Instead, afaik, .get_feature_names()
was deprecated in latest versions in favor of .get_feature_names_out()
).
For what concerns the transformers that you are using, StandardScaler
belongs to the first category of transformers exposing the method, while both SimpleImputer
and OrdinalEncoder
do belong to the second. The docs show the exposed methods within the Methods paragraphs. As said, this causes problems when doing something like pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out())
on your pipeline
, but it would cause problems as well on your categorical_preprocessing
and continuous_preprocessing
pipelines (as in both cases at least one transformer lacks of the method) and on the preprocessing
ColumnTransformer
instance.
There's an ongoing attempt in sklearn
to enrich all estimators with the .get_feature_names_out()
method. It is tracked within github issue #21308, which, as you might see, branches in many PRs (each one dealing with a specific module). For instance, issue #21079 for the preprocessing module, which will enrich the OrdinalEncoder
among the others, issue #21078 for the impute module, which will enrich the SimpleImputer
. I guess that they'll be available in a new release as soon as all the referenced PR will be merged.
In the meanwhile, imo, you should go with a custom solution that might fit your needs. Here's a simple example, which do not necessarily resemble your need, but which is meant to give a (possible) way of proceeding:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
'expert_rating': [5, 3, 4, 5, np.NaN],
'user_rating': [4, 5, 4, np.NaN, 3]})
X

num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(exclude=np.number).columns.tolist()
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values='', strategy='most_frequent')),
('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(continuous_preprocessing, num_cols),
(categorical_preprocessing, cat_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
X_trans = pipeline.fit_transform(X)
pd.DataFrame(X_trans, columns= num_cols + cat_cols)
