StandardScaler to whole training dataset or to individual folds for Cross Validation

Question

I'm currently using cross_val_score and KFold to assess the impact of using StandardScaler at different points within data pre-processing, specifically whether scaling the entire training dataset prior to performing cross validation introduces data leakage and what the effect of this is when compared to scaling the data from within a Pipeline (and therefore only applying it to the training folds).

my current process is as follows:

Experiment A

Import the boston housing dataset from sklearn.datasets and split into Data (X) and target (y)
create a Pipeline (sklearn.pipeline), that applies StandardScaler before applying linear regression
Specify the cross validation method as KFold with 5 folds
Perform cross validation (cross_val_score) using the above Pipeline and KFold method and observe the score

Experiment B

Use the same boston housing data as above
fit_transform StandardScaler on the entire dataset
Use cross_val_Score to perform cross validation on again 5 folds but this time input LinearRegression directly rather than a pipeline
Compare the scores here to Experiment A

The scores obtained are identical (to around 13 decimal places) which I question as surely Experiment B introduces Data Leakage during cross validation.

I've seen posts stating that it doesnt matter whether scaling is done on the entire training set before cross validation, if this is true I'm looking to understand why, if this isn't true I'd like to understand why the scores can still be so similar despite the data leakage?

See my test code below:

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn import datasets

from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import KFold, StratifiedKFold

from sklearn.model_selection import cross_val_score, cross_val_predict

from sklearn.linear_model import LinearRegression

np.set_printoptions(15)

boston = datasets.load_boston()
X = boston["data"]
y = boston["target"]

scalar = StandardScaler()
clf = LinearRegression()

class StScaler(StandardScaler):
    def fit_transform(self,X,y=None):
        print('Length of Data on which scaler is fit on =', len(X))
        output = super().fit(X,y)
#         print('mean of scalar =',output.mean_)
        output = super().transform(X)
        return output


pipeline = Pipeline([('sc', StScaler()), ('estimator', clf)])

cv = KFold(n_splits=5, random_state=42)

cross_val_score(pipeline, X, y, cv = cv)

# Now fitting Scaler on whole train data

scaler_2 = StandardScaler()
clf_2 = LinearRegression()

X_ss = scaler_2.fit_transform(X)
cross_val_score(clf_2, X_ss, y, cv=cv)

Thanks!

As it relates to theory and best practice more than coding how-to, this might be a better question for [stats.se] or [datascience.se] — G. Anderson, Feb 05 '20 at 18:27

StandardScaler to whole training dataset or to individual folds for Cross Validation

0 Answers0