I have a problem. I want to use StandardScaler()
, but my dataset contains certain OneHotEncoding
values and other values that should be not be scaled. But if I'm running the StandardScaler()
all the values are scaled. So is there an option to run this method only on certain values inside a pipeline?
I found this question: One-Hot-Encode categorical variables and scale continuous ones simultaneouely with the below code
columns = ['rank']
columns_to_scale = ['gre', 'gpa']
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)
So is there an option to only run the StandardScaler()
inside a pipeline
on only certain values and the other values should be merged to the scaled values?
So the pipeline should only use StandardScaler on the values 'xy', 'xyz'
.
StandardScaler Class
from sklearn.base import BaseEstimator, TransformerMixin
class StandardScaler_with_certain_features(BaseEstimator, TransformerMixin):
def __init__(self, columns_to_scale):
scaler = StandardScaler()
def fit(self, X, y = None):
scaler.fit(X_train) # only std.fit on train set
X_train_nor = scaler.transform(X_train.values)
def transform(self, X, y = None):
return X
Pipeline
columns_to_scale = ['xy', 'xyz']
steps = [('standard_scaler', StandardScaler_with_certain_features(columns_to_scale)),
('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=100))),
('lasso', Lasso(alpha=0.03))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
parameteres = { }
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))