3

I've seen some experiments using two different StandardScaler as follows:

scaler_1 = StandardScaler().fit(X_train)
train_sc = scaler_1.transform(X_train)

scaler_2 = StandardScaler().fit(X_test)
test_sc = scaler_2.fit(X_test)

I understand that one shouldn't bias the classifier mixing train/test data, but i would like to know if this other scenario is correct or not:

# X_all represents X feature vector before splitting (train + test)
X_scaled = StandardScaler().fit_transform(X_all)

X_train, y_train, X_test, y_test = train_test_split(X_scaled,y_all)

Besides, i would like to know how this case extends to KFold cross-validation.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
heresthebuzz
  • 678
  • 7
  • 21

2 Answers2

13

It is not correct to perform standardization before splitting the data. In general, you should not fit any preprocessing algorithm (PCA, StandardScaler...) on the whole dataset, but only on the training set, and use the fitted algorithm to transform the test set.

Thus, none of the two experiences you propose are correct. What you should do is:

scaler = StandardScaler().fit(X_train)
train_sc = scaler.transform(X_train)

test_sc = scaler.transform(X_test)

It is easy to understand if you think of it this way: the test set is used to get an estimate of the performance of the model on unseen data. So you should behave as if you didn't have access to the test set while training the algorithm, and this is also valid for cross validation.

When you fit the standard scaler on the whole dataset, information from the test set is used to normalize the training set. This is a common case of "data leakage", which means that information from the test set is used while training the model. This often results in overestimates of the model's performance.

Note that in scikit-learn you can use Pipelines in order to chain the preprocessing steps with the estimator, and use it in the cross validation process. This will ensure that the same steps are repeated for each folds of the cross validation process.

A Co
  • 908
  • 6
  • 15
  • Hello @A Co, what if you use something like `cross_val_score` or another algorithm where you don't know what will be test or train data (i.e. `KFold`)? – heresthebuzz Jul 28 '20 at 20:10
  • [Scikit-learn's Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) are the tool you need for that. You can read this [answer (and the associated post)](https://stackoverflow.com/a/44447786/7080911) that tackles the exact same issue and give an example of how to use pipelines in that case. – A Co Jul 29 '20 at 07:00
  • @ACo in my notebook I am doing this `std = StandardScaler() std.fit(X.values) X_tr = std.transform(X.values)` **after** the correlation matrix and **before** running the Lasso model. (I use the Lasso model's coef's to select predictors for the simple model). Going by your answer this is not good? If it's not good how can I get standardized regression coef's before running Lasso so I can use the Lasso model's coef's to select predictors? – Edison Jun 24 '22 at 12:10
3

The best practice is to imagine you have deployed your model, and it is being used to predict things. Imagine a single test case is provided to your model for testing, or your model tends to predict an input after deployment. In this scenario, you only have a single input, and therefore it doesn't make sense to use it as fitting data for a Standard Scaler which gets this single instance as its training (fitting) data, because in this case, the output of the scaler would be different in terms of scale with every other input. Thus, you'd better train (fit) the Standard Scaler using the training set (i.e. after splitting) and transform the rest of data (including the validation set, your test set, and whatever data comes into your model after deployment) using the fitted Standard Scaler.

Moreover, in every stage of a machine learning project, you'd better use only the training data for fitting and training whatever you need (e.g. scalers, predictors, regressors, etc.) and leave the validation and test data only for validation and testing.

For cross-validation case, you'd better fit a scaler and transform your data within cross-validation, but it generally doesn't make much difference. You can test it though.

kaavehh
  • 75
  • 8
  • great explanation, but what if you are using builtin `cross_val_score` or another scikit-learn functions where you don't have control over train/test datasets? – heresthebuzz Jul 28 '20 at 20:12
  • 2
    @heresthebuzz, these are the perfect scenarios where we should ideally using scikit-learn pipelines. They handle all these things internally and automatically. – learnToCode Jul 27 '22 at 14:03