I'm currently using cross_val_score and KFold to assess the impact of using StandardScaler at different points within data pre-processing, specifically whether scaling the entire training dataset prior to performing cross validation introduces data leakage and what the effect of this is when compared to scaling the data from within a Pipeline (and therefore only applying it to the training folds).
my current process is as follows:
Experiment A
- Import the boston housing dataset from sklearn.datasets and split into Data (X) and target (y)
- create a Pipeline (sklearn.pipeline), that applies StandardScaler before applying linear regression
- Specify the cross validation method as KFold with 5 folds
- Perform cross validation (cross_val_score) using the above Pipeline and KFold method and observe the score
Experiment B
- Use the same boston housing data as above
- fit_transform StandardScaler on the entire dataset
- Use cross_val_Score to perform cross validation on again 5 folds but this time input LinearRegression directly rather than a pipeline
- Compare the scores here to Experiment A
The scores obtained are identical (to around 13 decimal places) which I question as surely Experiment B introduces Data Leakage during cross validation.
I've seen posts stating that it doesnt matter whether scaling is done on the entire training set before cross validation, if this is true I'm looking to understand why, if this isn't true I'd like to understand why the scores can still be so similar despite the data leakage?
See my test code below:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LinearRegression
np.set_printoptions(15)
boston = datasets.load_boston()
X = boston["data"]
y = boston["target"]
scalar = StandardScaler()
clf = LinearRegression()
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print('Length of Data on which scaler is fit on =', len(X))
output = super().fit(X,y)
# print('mean of scalar =',output.mean_)
output = super().transform(X)
return output
pipeline = Pipeline([('sc', StScaler()), ('estimator', clf)])
cv = KFold(n_splits=5, random_state=42)
cross_val_score(pipeline, X, y, cv = cv)
# Now fitting Scaler on whole train data
scaler_2 = StandardScaler()
clf_2 = LinearRegression()
X_ss = scaler_2.fit_transform(X)
cross_val_score(clf_2, X_ss, y, cv=cv)
Thanks!