Using scaler in Sklearn PIpeline and Cross validation

Question

I previously saw a post with code like this:

scalar = StandardScaler()
clf = svm.LinearSVC()

pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)

My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.

In the above code, how can I know that Sklearn is following the same strategy? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds. Would that mean I should not use the above codes?

I do like the above codes because it saves tons of time.

Dave Bowman · Answer 1 · 2020-05-28T23:01:01.550

In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split:

folds = 4

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)

scalar = StandardScaler()
clf = svm.LinearSVC()

pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)

I think best practice is to only use the training data set (i.e., X_train, y_train) when tuning the hyperparameters of your model, and the test data set (i.e., X_test, y_test) should be used as a final check, to make sure your model isn't biased towards the validation folds. At that point you would apply the same scaler that you fit on your training data set to your testing data set.

Thank you for your help. Sorry that I cant upvote due to my reputation yet. I will once I have enough. — LuLULULU W, Jun 02 '20 at 11:07

Ben Reiniger · Answer 2 · 2020-06-01T01:48:45.487

Yes, this is done properly; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.

Some references.

Section 6.1.1 of the User Guide:

Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

The note at the end of section 3.1.1 of the User Guide:

Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
...code sample...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
...

Finally, you can look into the source for cross_val_score. It calls cross_validate, which clones and fits the estimator (in this case, the entire pipeline) on each training split. GitHub link.

Thank you for your answer. Where can I see the logic of this implementation then? — LuLULULU W, May 31 '20 at 10:39
Thank you very much for your help. Sorry that I can't upvote yet. I will once I got more reputation. — LuLULULU W, Jun 02 '20 at 11:08

Using scaler in Sklearn PIpeline and Cross validation

2 Answers2