9

I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:

df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])

But now assume I have column 'C' in df of type string and the following pipeline definition

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline =  Pipeline([
                ('standard', StandardScaler())
            ])

df_scaled = pipeline.fit_transform(df)

How can I tell StandardScaler to only scale columns A and B?

I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:

normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58

2 Answers2

7

You could check out sklearn-pandas which offers an integration of Pandas DataFrame and sklearn, e.g. with the DataFrameMapper:

mapper = DataFrameMapper([
...     (list_of_columnnames, StandardScaler())
... ])

I if you don't need external dependencies, you could use a simple own transformer, as I answered here:

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

pipe =  make_pipeline(Columns(names=list_of_columnnames),StandardScaler())
Marcus V.
  • 6,323
  • 1
  • 18
  • 33
  • It was really hard for me to decide which answer finally to accept. I went for @ami Tavory's answer because when using ibex the semantics are exactly as needed (I can specify the columns to apply the transformer and the columns returned at the pipeline element level – Romeo Kienzler May 14 '18 at 11:05
  • Fair enough :). In the end, it's all viable answers and a matter of taste/or the specific case I guess! – Marcus V. May 14 '18 at 11:49
  • @MarcusV. after standard scaling the data, the columns that has been sent to `StandardScaler` will return to the next step in the pipeline. How do I send only some to columns to `StandardScaler`, but to next step, send the remaining columns and scaled columns. – Naveen Reddy Marthala Oct 08 '20 at 09:39
  • For the `DataFrameMapper` you can use the parameter `default` and set it to `None`, then they will be forwarded without transformation. For the Pipeline you could also use `sklearn.compose.ColumnTransformer` or write you own simple transformer as above and simply `return X` in `transform`. – Marcus V. Oct 12 '20 at 12:01
4

In direct sklearn, you'll need to use FunctionTransformer together with FeatureUnion. That is, your pipeline will look like:

pipeline =  Pipeline([
            ('scale_sum', feature_union(...))
        ])

where within the feature union, one function will apply the standard scaler to some of the columns, and the other will pass the other columns untouched.


Using Ibex (which I co-wrote exactly to make sklearn and pandas work better), you could write it as follows:

from ibex.sklearn.preprocessing import StandardScaler
from ibex import trans

pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • I can confirm that with a single stage IBEX works as designed and preserves the untouched columns, but when concatenating stages it doesn't work, can you please have a look at the following issue? https://stackoverflow.com/questions/50329184/pipelining-transformer-stages-in-ibex-column-access-problems-in-scikit-learn-an – Romeo Kienzler May 14 '18 at 11:28
  • @RomeoKienzler Thanks, I answered your question there. – Ami Tavory May 14 '18 at 21:24