2

I have always learned that standardization or normalization should be fit only on the training set, and then be used to transform the test set. So what I'd do is:

scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)

Now if I were to use this model on new data I could just save 'scaler' and load it to any new script.

I'm having trouble though understanding how this works for K-fold CV. Is it best practice to re-fit and transform the scaler on every fold? I could understand how this works on building the model, but what if I want to use this model later on. Which scaler should I save?

Further I want to extend this to time-series data. I understand how k-fold works for time-series, but again how do I combine this with CV? In this case I would suggest saving the very last scaler as this would be fit on 4/5th (In case of k=5) of the data, having it fit on the most (recent) data. Would that be the correct approach?

Sievag
  • 33
  • 5
  • In theory yes you should retrain the scaler k times for validation. In practice I've found that it doesn't matter as long as your sample in each fold is large. If you want to use the model later on, you should keep the model and the scaler trained on all available training data. – C8H10N4O2 Oct 15 '20 at 15:13
  • Thanks for your answer. I still have a question however. What you say in the last sentence: 'you should keep the model and the scaler trained on all available training data'. However when re-fitting the scaler for every fold, I essentially create 5 different scalers, none of them having been fit on all the data, right? So I would not end up with a single scaler fit on all training data. – Sievag Oct 15 '20 at 15:16
  • use cross-validation to *estimate* your model's out-of-sample performance, but when you want to use the model to make out-of-sample predictions, it's advantageous to train it on all available data, not just (k-1)/k of it, and do the same thing with your scaler – C8H10N4O2 Oct 15 '20 at 15:37
  • So if I understand this correctly I'll first use K-fold (let's say K=5) CV, with 5 different instances of the scaler (Each fit on 4/5th and then transform the remaining 1/5th), in order to evaluate model performance. Then once I'm satisfied I'll build the model on the entire set, which has been fit_transformed all at once by the scaler? (Sorry if these are newbie questions, I'm kinda new to this lol) – Sievag Oct 15 '20 at 18:20
  • Yes that's correct – C8H10N4O2 Oct 15 '20 at 18:37
  • Okay thank you, I understand now. One more question I have (My apologies if this is not the correct way of asking questions, I'm new here), is it the same when using a hold-out sample as opposed to CV? I always thought that you train the model on let's say 80%, and then test it on 20%. But essentially you are saying that after training on 80%, I should train again but then on the full 100%, and that model is my final model? – Sievag Oct 16 '20 at 07:25
  • I answered your question below, but this site is really more for programming questions. Questions on best practices are likely to be closed, as I have voted here. You should check out https://stats.stackexchange.com/ – C8H10N4O2 Oct 16 '20 at 10:38

1 Answers1

1

Is it best practice to re-fit and transform the scaler on every fold?

Yes. You might want to read scikit-learn's doc on cross-validation:

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction.

Which scaler should I save?

Save the scaler (and any other preprocessing, i.e. a pipeline) and the predictor trained on all of your training data, not just (k-1)/k of it from cross-validation or 70% from a single split.

  • If you're doing a regression model, it's that simple.

  • If your model training requires hyperparameter search using cross-validation (e.g., grid search for xgboost learning parameters), then you have already gathered information from across folds, so you need another test set to estimate true out-of-sample model performance. (Once you have made this estimation, you can retrain yet again on combined train+test data. This final step is not always done for neural networks that are parameterized for a particular sample size.)

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134