How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

Question

In the example below,

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))

I am using StandardScaler(), is this the correct way to apply it to test set as well?

seralouk · Accepted Answer · 2020-10-20T19:36:43.743

40

Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.

What happens can be described as follows:

Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
Step 1: the scaler is fitted on the TRAINING data
Step 2: the scaler transforms TRAINING data
Step 3: the models are fitted/trained using the transformed TRAINING data
Step 4: the scaler is used to transform the TEST data
Step 5: the trained models predict using the transformed TEST data

Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).

Use something like this:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)

Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.

IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:

X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation 
    = train_test_split(X, y, test_size=0.15, random_state=1)

Then use:

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)

edited Oct 20 '20 at 19:36

answered Jul 22 '18 at 13:13

seralouk

30,938
9
118
133

this is a great answer @seralouk! thanks! does the input to the trained model need to be Scaled as well? I know that this is not part of the original question, but would be great if you could add it to your answer. – user308827 Jul 22 '18 at 13:24
Hello. When you say "input to the trained model need to be Scaled as well" what exactly do you mean? – seralouk Jul 22 '18 at 13:28
15

No. If you `GridSearchCV` on the entire dataset then you have trained your hyperparameters on the entire dataset and no longer have a test set to validate your model. – Him Jul 22 '18 at 13:29
2

@seralouk, I mean can you add a line for `.predict()` as well? If I do something like `grid.predict(X_test)` will it work without needing to specify StandardScalar again? – user308827 Jul 22 '18 at 13:31
See step 5. Using `GridSearchCV ` on the entire dataset as I said will predict using the test data that will generated internally. The `print(grid.best_score_)` will return the best `cross-validated` accuracy score – seralouk Jul 22 '18 at 13:32
You do not need to call `grid.predict(X_test)`. This is internally done when you call `grid.fit(X, y)`. See again step 5 and the code that I posted. `print(grid.cv_results_)`will return the cross-validated accuracy score ! – seralouk Jul 22 '18 at 13:33
@Scott I would also leave a test set out as you suggested. However, it's not used to validate anything about the model, only estimate future performance. – Bert Kellerman Jul 22 '18 at 15:50
5

I think the original poster was wondering whether if grid.predict(X_test) is called after the whole cross validation, the scaling is still applied to X_test? And if so, what type of scaling, using the mean/std from X_train or is a new mean/std computed based on X_Test. I was wondering the very same thing. Appreciate any help. – Tartaglia Jan 04 '19 at 08:00
1

As already pointed out, you should still separate a test set and keep it "in a vault", only passing training data to `GridSearchCV`: you do not want to tune the hyperparameters on the whole dataset because by doing so you have no real test set to calculate the actual empirical error onto. – gented Jan 06 '19 at 13:33
2

@serafeim This is indeed a great answer. When I usually use a StandardScaler, I use two different instances of StandardScaler to scale my data. I always use a scaler to fit on X_train and one to fit y_train then i use each instance to transform X_test and y_test respectively to avoid any outliers etc being "leaked" into to train. Is there a way to do that using pipeline and is that extra scaler even necessary? – novawaly Jun 24 '19 at 12:44
Just a small comment: There is a typo: StandardScalar should be StandardScaler. I tried to suggest an edit, but it says the suggested edit queue is full – Amber Elferink Oct 20 '20 at 19:32
Do you really need to init the inner componets of the Pipeline? – Antonio Sesto Jun 29 '23 at 14:04

vcmorini · Answer 2 · 2020-09-25T22:12:40.710

Quick answer: Your methodology is correct.

Although the above answer is very good, I just would like to point out some subtleties:

best_score_ [1] is the best cross-validation metric, and not the generalization performance of the model [2]. To evaluate how well the best found parameters generalize, you should call the score on the test set, as you've done. Therefore it is needed to start by splitting the data into training and test set, fit the grid search only in the X_train, y_train, and then score it with X_test, y_test [2].

Deep Dive:

A threefold split of data into training set, validation set and test set is one way to prevent overfitting in the parameters during grid search. On the other hand, GridSearchCV uses Cross-Validation in the training set, instead of having both training and validation set, but this does not replace the test set. This can be verified in [2] and [3].

References:

[1] GridSearchCV

[2] Introduction to Machine Learning with Python

[3] 3.1 Cross-validation: evaluating estimator performance

How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

2 Answers2

Linked