I'm trying to create an sklearn.compose.ColumnTransformer
pipeline for transforming both categorical and continuous input data:
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
df = pd.DataFrame(
{
'a': [1, 'a', 1, np.nan, 'b'],
'b': [1, 2, 3, 4, 5],
'c': list('abcde'),
'd': list('aaabb'),
'e': [0, 1, 1, 0, 1],
}
)
for col in df.select_dtypes('object'):
df[col] = df[col].astype(str)
categorical_columns = list('acd')
continuous_columns = list('be')
categorical_transformer = OneHotEncoder(sparse=False, handle_unknown='ignore')
continuous_transformer = 'passthrough'
column_transformer = ColumnTransformer(
[
('categorical', categorical_transformer, categorical_columns),
('continuous', continuous_transformer, continuous_columns),
]
,
sparse_threshold=0.,
n_jobs=-1
)
X = column_transformer.fit_transform(df)
I want to access the feature names created by this transformation pipeline, so I try this:
column_transformer.get_feature_names()
Which raises:
NotImplementedError: get_feature_names is not yet supported when using a 'passthrough' transformer.
Since I'm not technically doing anything with columns b
and e
, I technically could just append them onto X
after one-hot encoding all other features, but is there some way I can use one of the scikit base classes (e.g. TransformerMixin
, BaseEstimator
, or FunctionTransformer
) to add to this pipeline so I can grab the continuous feature names in a very pipeline-friendly way?
Something like this, perhaps:
class PassthroughTransformer(FunctionTransformer, BaseEstimator):
def fit(self):
return self
def transform(self, X)
self.X = X
return X
def get_feature_names(self):
return self.X.values.tolist()
continuous_transformer = PassthroughTransformer()
column_transformer = ColumnTransformer(
[
('categorical', categorical_transformer, categorical_columns),
('continuous', continuous_transformer, continuous_columns),
]
,
sparse_threshold=0.,
n_jobs=-1
)
X = column_transformer.fit_transform(df)
But this raises this exception:
TypeError: Cannot clone object '<__main__.PassthroughTransformer object at 0x1132ddf60>' (type <class '__main__.PassthroughTransformer'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.