6

I have seen this topic showing many times on the Internet but never have a seen a complete, comprehensive solution that would work across the board for all use cases with the current library versions of sklearn. Could somebody please try to explain how that should be achieved using the following example?

In this example I'm using the following dataset

data = pd.read_csv('heart.csv')

# Preparing individual pipelines for numerical and categorical features
pipe_numeric = Pipeline(steps=[
    ('impute_num', SimpleImputer(
        missing_values = np.nan, 
        strategy = 'median', 
        copy = False, 
        add_indicator = True)
    )
])

pipe_categorical = Pipeline(steps=[
    ('impute_cat', SimpleImputer(
        missing_values = np.nan, 
        strategy = 'constant', 
        fill_value = 99999,
        copy = False)
    ),
    ('one_hot', OneHotEncoder(handle_unknown='ignore'))
])

# Combining them into a transformer
transformer_union = ColumnTransformer([
    ('feat_numeric', pipe_numeric, ['age']),
    ('feat_categorical', pipe_categorical, ['cp']),
], remainder = 'passthrough')

# Fitting the transformer
transformer_union.fit(data)

# We can then apply and get the data in the following way
transformer_union.transform(data)

# And it has the following shape
transformer_union.transform(data).shape

Now comes the main question: how to efficiently combine the output numpy array with the new column names that resulted from all the transformations? This example, even though would require quite some work, is still relatively simple, but this can get severely more complicated with bigger pipelines.

# Transformers object
transformers = transformer_union.named_transformers_

# Categorical features (from transformer)
transformers['feat_categorical'].named_steps['one_hot'].get_feature_names()

# Numerical features (from transformer) - no names are available? 
transformers['feat_numeric'].named_steps['impute_num']

# All the other columns that were not transformed - no names are available?
transformers['remainder']

I've checked all kind of different examples and there doesn't seem to be any silver bullet for this:

  1. sklearn doesn't support this natively - there's no way to get an aligned vector of column names that could be easily combined with the array into a new DF, but perhaps I'm mistaken - could anyone point me to a resource if that's the case?

  2. Some people were implementing their custom transformers/ pipelines, but this gets a bit hectic when you want to build large pipelines

  3. Are there any other sklearn-related packages that alleviate that issue?

I'm a little bit surprised by how sklearn manages that - in R in the tidymodels ecosystem (it's still under development, but nevertheless), this is handled very easily with the prep and bake methods. I would imagine it could somehow be done similarly.

Inspecting the final output in its entirety is vital to the data science work - could anyone advise on the best path?

Dulaj Kulathunga
  • 1,248
  • 2
  • 9
  • 19

1 Answers1

0

The sklearn devs are working on this; discussion spans several SLEPs and many Issues. There is already some progress, with some transformers implementing get_features_names and others having internal attributes tracking column names when the input was a pandas dataframe. ColumnTransformer does have a get_feature_names, but Pipeline does not, so that it would fail on your example.

The most complete current solution seems to be sklearn-pandas:
https://github.com/scikit-learn-contrib/sklearn-pandas

Another interesting approach is hidden away inside eli5. In their explain_weights, they have a generic function transform_feature_names. It has a few specialized dispatches, but otherwise tries to call get_feature_names; most notably, there is a dispatch for Pipeline. Unfortunately, currently this will fail on a ColumnTransformer with a Pipeline as a transformer; see https://stackoverflow.com/a/62124484/10495893 for an example and a potential workaround.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29