Using sklearn StandardScaler on only select columns

Question

I have a numpy array X that has 3 columns and looks like the following:

array([[    3791,     2629,        0],
       [ 1198760,   113989,        0],
       [ 4120665,        0,        1],
       ...

The first 2 columns are continuous values and the last column is binary (0,1). I would like to apply the StandardScaler class only to the first 2 columns. I am currently doing this the following way:

scaler = StandardScaler()
X_subset = scaler.fit_transform(X[:,[0,1]])
X_last_column = X[:, 2]
X_std = np.concatenate((X_subset, X_last_column[:, np.newaxis]), axis=1)

The output of X_std is then:

array([[-0.34141308, -0.18316715,  0.        ],
       [-0.22171671, -0.17606473,  0.        ],
       [ 0.07096154, -0.18333483,  1.        ],
       ...,

Is there a way to perform this all in one step? I would like to include this as part of a pipeline where it will scale the first 2 columns and leave the last binary column as is.

score 5 · Answer 1 · edited Mar 03 '22 at 08:39

I ended up using a class to select columns like this:

class ItemSelector(BaseEstimator, TransformerMixin):

    def __init__(self, columns):
        self.columns = columns

    def fit(self, x, y=None):
        return self

    def transform(self, data_array):
        return data_array[:, self.columns]

I then used FeatureUnion in my pipeline as follows to fit StandardScaler only to continuous variables:

FeatureUnion(
    transformer_list=[
        ('continous', Pipeline([  # Scale the first 2 numeric columns
            ('selector', ItemSelector(columns=[0, 1])),
            ('scaler', StandardScaler())
        ])),
        ('categorical', Pipeline([  # Leave the last binary column as is
            ('selector', ItemSelector(columns=[2]))
        ]))
    ]
)

This worked well for me.

score 5 · Accepted Answer · answered Sep 27 '19 at 09:18

5

Since scikit-learn version 0.20 you can use the function sklearn.compose.ColumnTransformer exactly for this purpose.

answered Sep 27 '19 at 09:18

00schneider

698
9
21

score 1 · Answer 3 · answered Mar 14 '18 at 16:49

1

I can't think of another way to compact you code more, but you can definitely use your transformation in a Pipeline. You have to define a class extending StandardScaler that only performs the transformations on the columns passed as arguments, keeping the others intact. See the code in this example, you would have to program something similar to ItemSelector.

answered Mar 14 '18 at 16:49

skd

1,865
1
21
29

Thank you. The ItemSelector class you linked and FeatureUnion are what I would need to put this into a Pipeline. – billypilgrim Mar 14 '18 at 19:59

J. Blauvelt · Answer 4 · 2018-07-11T14:49:14.337

Inspired by skd's recommendation to extend StandardScaler, I came up with the below. It's not super efficient or robust (e.g., you'd need to update the inverse_transform function), but hopefully it's a helpful starting point:

class StandardScalerSelect(StandardScaler):

    def __init__(self, copy=True, with_mean=True, with_std=True, cols=None):
        self.cols = cols
        super().__init__(copy, with_mean, with_std)

    def transform(self, X):

        not_transformed_ix = np.isin(np.array(X.columns), np.array(self.cols), invert=True)

        # Still transforms all, just for convenience. For larger datasets
        # will want to modify self.mean_ and self.scale_ so the dimensions match,
        # and then just transform the subset
        trans = super().transform(X)

        trans[:,not_transformed_ix] = np.array(X.iloc[:,not_transformed_ix])

        return trans

Using sklearn StandardScaler on only select columns

4 Answers4

Related