I', creating some pipelines using scikit-learn but I'm having some trouble keeping the variables names as the original names, and not as the transformer_name__feature_name format
This is the scenario:
- I have a set of transformers, both custom and some from scikit-learn itself
- The set of transformers used in each step and the columns it uses is defined in an external file, from which I don't know beforehand which transformers I'm going to apply and to which columns, for example, let's say in a python dictionary named data, it would look like this
[{'transformer': MinMaxScaler(), 'columns': ['column1', 'column2'], 'name': 'MinMaxScaler'},
{'transformer': CustomTransfomer(), 'columns': ['column2', 'column5'], 'name': 'CustomTransfomer'}]
Now I create the pipeline from this definition like this.
transformers = [(step["name"],
step["transformer"], step["columns"])
for step in data["steps"]]
preprocessor = ColumnTransformer(transformers=transformers,
remainder='passthrough',
verbose_feature_names_out=False)
pipe = Pipeline([('preprocessor', preprocessor)])
I try to use the parameter verbose_feature_names_out=True to prevent the default prefix naming, but I get an error saying that column names are not unique.
If I set verbose_feature_names_out=True
then the problem in this example is that column 2 gets applied to the first transformation step, but not the second one, as the name of the column is changed to MinMaxScaler__column2
, so I end up with columns named MinMaxScaler__column2
and CustomTransformer__column2
, but both transformations were applied individually, not one after the other.
In this example, how can I apply both transformers to the specified columns and, in the end, remind with the original column number and names column1,...,column5?