Avoid scaling binary columns in sci-kit learn StandsardScaler

Question

I'm building a linear regression model in sci-kit learn, and am scaling the inputs as a preprocessing step in a sci-kit learn Pipeline. Is there any way I can avoid scaling binary columns? What's happening is that these columns are being scaled with every other column, causing the values to be centered around 0, rather than being 0 or 1, so I'm getting values like [-0.6, 0.3], which cause input values of 0 to influence predictions in my linear model.

Basic code to illustrate:

>>> import numpy as np
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import Ridge
>>> X = np.hstack( (np.random.random((1000, 2)),
                np.random.randint(2, size=(1000, 2))) )
>>> X
array([[ 0.30314072,  0.22981496,  1.        ,  1.        ],
       [ 0.08373292,  0.66170678,  1.        ,  0.        ],
       [ 0.76279599,  0.36658793,  1.        ,  0.        ],
       ...,
       [ 0.81517519,  0.40227095,  0.        ,  0.        ],
       [ 0.21244587,  0.34141014,  0.        ,  0.        ],
       [ 0.2328417 ,  0.14119217,  0.        ,  0.        ]])
>>> scaler = StandardScaler()
>>> scaler.fit_transform(X)
array([[-0.67768374, -0.95108883,  1.00803226,  1.03667198],
       [-1.43378124,  0.53576375,  1.00803226, -0.96462528],
       [ 0.90632643, -0.48022732,  1.00803226, -0.96462528],
       ...,
       [ 1.08682952, -0.35738315, -0.99203175, -0.96462528],
       [-0.99022572, -0.56690563, -0.99203175, -0.96462528],
       [-0.91994001, -1.25618613, -0.99203175, -0.96462528]])

I'd love for the output of the last line to be:

>>> scaler.fit_transform(X, dont_scale_binary_or_something=True)
array([[-0.67768374, -0.95108883,  1.        ,  1.        ],
       [-1.43378124,  0.53576375,  1.        ,  0.        ],
       [ 0.90632643, -0.48022732,  1.        ,  0.        ],
       ...,
       [ 1.08682952, -0.35738315,  0.        ,  0.        ],
       [-0.99022572, -0.56690563,  0.        ,  0.        ],
       [-0.91994001, -1.25618613,  0.        ,  0.        ]])

Any way I can accomplish this? I suppose I could just select the columns that aren't binary, only transform those, then replace the transformed values back into the array, but I'd like it to play nicely with the sci-kit learn Pipeline workflow, so I can just do something like:

clf = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge())])
clf.set_params(scaler__dont_scale_binary_features=True, ridge__alpha=0.04).fit(X, y)

miindlek · Answer 1 · 2016-06-08T08:10:57.470

You should create a custom scaler which ignores the last two columns while scaling.

from sklearn.base import TransformerMixin
import numpy as np

class CustomScaler(TransformerMixin): 
    def __init__(self):
        self.scaler = StandardScaler()

    def fit(self, X, y):
        self.scaler.fit(X[:, :-2], y)
        return self

    def transform(self, X):
        X_head = self.scaler.transform(X[:, :-2])
        return np.concatenate(X_head, X[:, -2:], axis=1)

score 7 · Answer 2 · answered Dec 29 '16 at 00:42

I'm posting code that I adapted from @miindlek's response just in case it is helpful to others. I encountered an error when I didn't include BaseEstimator. Thank you again @miindlek. Below, bin_vars_index is an array of column indexes for the binary variable and cont_vars_index is the same for the continuous variables that you want to scale.

from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class CustomScaler(BaseEstimator,TransformerMixin): 
    # note: returns the feature matrix with the binary columns ordered first  
    def __init__(self,bin_vars_index,cont_vars_index,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.bin_vars_index = bin_vars_index
        self.cont_vars_index = cont_vars_index

    def fit(self, X, y=None):
        self.scaler.fit(X[:,self.cont_vars_index], y)
        return self

    def transform(self, X, y=None, copy=None):
        X_tail = self.scaler.transform(X[:,self.cont_vars_index],y,copy)
        return np.concatenate((X[:,self.bin_vars_index],X_tail), axis=1)

score 5 · Answer 3 · answered Mar 12 '18 at 09:52

Your pipeline should change into:

from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion


pipeline=Pipeline(steps= [
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', Ridge())
    ]
)

score 4 · Answer 4 · answered Jan 04 '17 at 10:42

I have adapted @J_C code a bit to work with pandas data frame. You can pass column names that you want to scale and you get result with initial column order.

enter code here
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class CustomScaler(BaseEstimator,TransformerMixin): 
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns

    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        return self

    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.ix[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

Usage:

scale = CustomScaler(columns=['duration', 'num_operations'])
scaled = scale.fit_transform(churn_d)

score 3 · Answer 5 · edited Feb 09 '17 at 16:16

3

I found the concatenation in @Vitaliy Grabovets dataframe version doesn't work properly unless you specify the index for X_scaled. So the relevant line now reads:

X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns, index=X.index)

edited Feb 09 '17 at 16:16

Donald Duck

8,409
22
75
99

answered Feb 09 '17 at 15:25

jbn

31
2

score 2 · Answer 6 · answered Aug 29 '20 at 08:00

This probably makes it easier for you

    import pandas as pd
    import numpy as np

    X = np.hstack((np.random.random((1000, 2)),np.random.randint(2, size=        (1000, 2))))

    df=pd.DataFrame(X,columns=["num_1","num_2","binary_1","binary_2"])

    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder

    num_pipeline = Pipeline([            
        ('std_scaler', StandardScaler()),
    ])

    num_attribs=["num_1","num_2"]
    binary_attribs=["binary_1","binary_2"]


    full_pipeline = ColumnTransformer([
        ("num_cols", num_pipeline, num_attribs),
        ("binary_cols",OneHotEncoder(drop="first"),binary_attribs),
    ])

    full_pipeline.fit_transform(df)

Avoid scaling binary columns in sci-kit learn StandsardScaler

6 Answers6

Linked