I have seen this topic showing many times on the Internet but never have a seen a complete, comprehensive solution that would work across the board for all use cases with the current library versions of sklearn. Could somebody please try to explain how that should be achieved using the following example?
In this example I'm using the following dataset
data = pd.read_csv('heart.csv')
# Preparing individual pipelines for numerical and categorical features
pipe_numeric = Pipeline(steps=[
('impute_num', SimpleImputer(
missing_values = np.nan,
strategy = 'median',
copy = False,
add_indicator = True)
)
])
pipe_categorical = Pipeline(steps=[
('impute_cat', SimpleImputer(
missing_values = np.nan,
strategy = 'constant',
fill_value = 99999,
copy = False)
),
('one_hot', OneHotEncoder(handle_unknown='ignore'))
])
# Combining them into a transformer
transformer_union = ColumnTransformer([
('feat_numeric', pipe_numeric, ['age']),
('feat_categorical', pipe_categorical, ['cp']),
], remainder = 'passthrough')
# Fitting the transformer
transformer_union.fit(data)
# We can then apply and get the data in the following way
transformer_union.transform(data)
# And it has the following shape
transformer_union.transform(data).shape
Now comes the main question: how to efficiently combine the output numpy array with the new column names that resulted from all the transformations? This example, even though would require quite some work, is still relatively simple, but this can get severely more complicated with bigger pipelines.
# Transformers object
transformers = transformer_union.named_transformers_
# Categorical features (from transformer)
transformers['feat_categorical'].named_steps['one_hot'].get_feature_names()
# Numerical features (from transformer) - no names are available?
transformers['feat_numeric'].named_steps['impute_num']
# All the other columns that were not transformed - no names are available?
transformers['remainder']
I've checked all kind of different examples and there doesn't seem to be any silver bullet for this:
sklearn doesn't support this natively - there's no way to get an aligned vector of column names that could be easily combined with the array into a new DF, but perhaps I'm mistaken - could anyone point me to a resource if that's the case?
Some people were implementing their custom transformers/ pipelines, but this gets a bit hectic when you want to build large pipelines
Are there any other sklearn-related packages that alleviate that issue?
I'm a little bit surprised by how sklearn manages that - in R in the tidymodels
ecosystem (it's still under development, but nevertheless), this is handled very easily with the prep
and bake
methods. I would imagine it could somehow be done similarly.
Inspecting the final output in its entirety is vital to the data science work - could anyone advise on the best path?