2

I have a problem. I want to use StandardScaler(), but my dataset contains certain OneHotEncoding values and other values that should be not be scaled. But if I'm running the StandardScaler() all the values are scaled. So is there an option to run this method only on certain values inside a pipeline?

I found this question: One-Hot-Encode categorical variables and scale continuous ones simultaneouely with the below code

columns = ['rank']
columns_to_scale  = ['gre', 'gpa']

scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

So is there an option to only run the StandardScaler() inside a pipeline on only certain values and the other values should be merged to the scaled values? So the pipeline should only use StandardScaler on the values 'xy', 'xyz'.

StandardScaler Class

from sklearn.base import BaseEstimator, TransformerMixin
class StandardScaler_with_certain_features(BaseEstimator, TransformerMixin):
    def __init__(self, columns_to_scale):
        scaler = StandardScaler()
        

    def fit(self, X, y = None):
        scaler.fit(X_train) # only std.fit on train set
        X_train_nor = scaler.transform(X_train.values)

    def transform(self, X, y = None):
        return X

Pipeline

columns_to_scale  = ['xy', 'xyz']
    
steps = [('standard_scaler', StandardScaler_with_certain_features(columns_to_scale)),
         ('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=100))),
         ('lasso', Lasso(alpha=0.03))]

pipeline = Pipeline(steps) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

parameteres = { }

grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)                
grid.fit(X_train, y_train)
                    
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))

# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))
Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
Test
  • 571
  • 13
  • 32
  • Not sure I get your point as the post you are mentioning seems already specifying common techniques to achieve what you want, but perhaps I'm misinterpreting the question... – amiola Dec 18 '21 at 12:56
  • Another option I see (besides the one given by the answer) might be to create a class that instead of applying the scaling as you're doing, somehow selects the columns you want to apply the scaling on; then you might call its constructor within a pipeline that selects column first and then applies the scaling. – amiola Dec 18 '21 at 13:02

1 Answers1

7

You can include a ColumnTransformer in the Pipeline in order to apply the StandardScaler only to certain columns. You need to set remainder='passthrough' to make sure that the columns that are not scaled are concatenated with the ones that are scaled.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso

df = pd.DataFrame({
    'y': np.random.normal(0, 1, 100),
    'x': np.random.normal(0, 1, 100),
    'z': np.random.normal(0, 1, 100),
    'xy': np.random.normal(2, 3, 100),
    'xyz': np.random.normal(4, 5, 100),
})

X = df.drop(labels=['y'], axis=1)
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

preprocessor = ColumnTransformer(
    transformers=[('scaler', StandardScaler(), ['xy', 'xyz'])],
    remainder='passthrough'
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('lasso', Lasso(alpha=0.03))
])

pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
  • Thank you very much. But if I run your code I get `ValueError: A given column is not a column of the dataframe KeyError: 'xy'` – Test Dec 18 '21 at 15:01
  • This means that there is no column with name `'xy'` in the data frame with the features values. You would need to double check the column names of `X_train` and `X_test` and make sure to pass to the column transformer the correct names of the columns that need to be scaled. – Flavia Giammarino Dec 18 '21 at 15:07
  • Thank you for your quick answer. I really appreciate it. Is there an option to get the same result with X_train and X_test? – Test Dec 18 '21 at 15:13
  • What I means is when I run your code with `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)` I run in the above error. So is there an option to use your code with a test/train split ? – Test Dec 18 '21 at 15:19
  • I updated the code in my answer, but when you define the column transformer you need to make sure to replace `['xy', 'xyz']` with a list containing the actual names of the columns that need to be scaled. – Flavia Giammarino Dec 18 '21 at 15:19