How to preserve column names in scikit-learn ColumnTransformer?

Question

I', creating some pipelines using scikit-learn but I'm having some trouble keeping the variables names as the original names, and not as the transformer_name__feature_name format

This is the scenario:

I have a set of transformers, both custom and some from scikit-learn itself
The set of transformers used in each step and the columns it uses is defined in an external file, from which I don't know beforehand which transformers I'm going to apply and to which columns, for example, let's say in a python dictionary named data, it would look like this

[{'transformer': MinMaxScaler(), 'columns': ['column1', 'column2'], 'name': 'MinMaxScaler'}, 
 {'transformer': CustomTransfomer(), 'columns': ['column2', 'column5'], 'name': 'CustomTransfomer'}]

Now I create the pipeline from this definition like this.

transformers = [(step["name"],
                 step["transformer"], step["columns"])
                for step in data["steps"]]

preprocessor = ColumnTransformer(transformers=transformers,
                                 remainder='passthrough',
                                 verbose_feature_names_out=False)

pipe = Pipeline([('preprocessor', preprocessor)])

I try to use the parameter verbose_feature_names_out=True to prevent the default prefix naming, but I get an error saying that column names are not unique.

If I set verbose_feature_names_out=True then the problem in this example is that column 2 gets applied to the first transformation step, but not the second one, as the name of the column is changed to MinMaxScaler__column2, so I end up with columns named MinMaxScaler__column2 and CustomTransformer__column2, but both transformations were applied individually, not one after the other.

In this example, how can I apply both transformers to the specified columns and, in the end, remind with the original column number and names column1,...,column5?

score 2 · Answer 1 · answered Nov 22 '22 at 09:28

The ColumnTransformer can only perform one transform per column.

If you want to perform for column2 2 transformation, you should define a pipeline that perform first the MinMaxScaler and then your CustomTransformer.

I would modify your code as follows:

from sklearn.pipeline import make_pipeline
data = [
    {'transformer': MinMaxScaler(), 'columns': ['column1'], 'name': 'MinMaxScaler'},
    {'transformer': CustomTransformer(), 'columns': ['column5'], 'name': 'CustomTransfomer'},
    {
     'transformer': make_pipeline(MinMaxScaler(),CustomTransformer()),
     'columns': ['column2'],
     'name': 'pipeline'
    }
]

This will define a new transformer that perform both operations.

How to preserve column names in scikit-learn ColumnTransformer?

1 Answers1