5

I have a numpy array X that has 3 columns and looks like the following:

array([[    3791,     2629,        0],
       [ 1198760,   113989,        0],
       [ 4120665,        0,        1],
       ...

The first 2 columns are continuous values and the last column is binary (0,1). I would like to apply the StandardScaler class only to the first 2 columns. I am currently doing this the following way:

scaler = StandardScaler()
X_subset = scaler.fit_transform(X[:,[0,1]])
X_last_column = X[:, 2]
X_std = np.concatenate((X_subset, X_last_column[:, np.newaxis]), axis=1)

The output of X_std is then:

array([[-0.34141308, -0.18316715,  0.        ],
       [-0.22171671, -0.17606473,  0.        ],
       [ 0.07096154, -0.18333483,  1.        ],
       ...,

Is there a way to perform this all in one step? I would like to include this as part of a pipeline where it will scale the first 2 columns and leave the last binary column as is.

billypilgrim
  • 135
  • 1
  • 2
  • 8

4 Answers4

5

I ended up using a class to select columns like this:

class ItemSelector(BaseEstimator, TransformerMixin):

    def __init__(self, columns):
        self.columns = columns

    def fit(self, x, y=None):
        return self

    def transform(self, data_array):
        return data_array[:, self.columns]

I then used FeatureUnion in my pipeline as follows to fit StandardScaler only to continuous variables:

FeatureUnion(
    transformer_list=[
        ('continous', Pipeline([  # Scale the first 2 numeric columns
            ('selector', ItemSelector(columns=[0, 1])),
            ('scaler', StandardScaler())
        ])),
        ('categorical', Pipeline([  # Leave the last binary column as is
            ('selector', ItemSelector(columns=[2]))
        ]))
    ]
)

This worked well for me.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
billypilgrim
  • 135
  • 1
  • 2
  • 8
5

Since scikit-learn version 0.20 you can use the function sklearn.compose.ColumnTransformer exactly for this purpose.

00schneider
  • 698
  • 9
  • 21
1

I can't think of another way to compact you code more, but you can definitely use your transformation in a Pipeline. You have to define a class extending StandardScaler that only performs the transformations on the columns passed as arguments, keeping the others intact. See the code in this example, you would have to program something similar to ItemSelector.

skd
  • 1,865
  • 1
  • 21
  • 29
1

Inspired by skd's recommendation to extend StandardScaler, I came up with the below. It's not super efficient or robust (e.g., you'd need to update the inverse_transform function), but hopefully it's a helpful starting point:

class StandardScalerSelect(StandardScaler):

    def __init__(self, copy=True, with_mean=True, with_std=True, cols=None):
        self.cols = cols
        super().__init__(copy, with_mean, with_std)

    def transform(self, X):

        not_transformed_ix = np.isin(np.array(X.columns), np.array(self.cols), invert=True)

        # Still transforms all, just for convenience. For larger datasets
        # will want to modify self.mean_ and self.scale_ so the dimensions match,
        # and then just transform the subset
        trans = super().transform(X)

        trans[:,not_transformed_ix] = np.array(X.iloc[:,not_transformed_ix])

        return trans 
J. Blauvelt
  • 785
  • 3
  • 14