1

Usually, when using a NN, I do the normalization in this form:

scaler = StandardScaler()
train_X = scaler.fit_transform( train_X )
test_X = scaler.transform( test_X )

That is, I normalize after the split, so that there are no leaks from the test set to the train set. But I am having doubts about this when using a LSTM.

Imagine that my last sequence in the train set in a LSTM is X = [x6, x7, x8], Y = [x9].

Then, my first sequence in the test set should be X = [x7, x8, x9], Y = [x10].

So, does it make sense to normalize the data after splitting if I end up mixing the values from the two sets in the X of the test set? Or should I normalize the entire dataset before with

scaler = StandardScaler()
data = scaler.fit_transform( data )

and then do the split?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
julix
  • 13
  • 5
  • Does this answer your question? [Thoughts about train\_test\_split for machine learning](https://stackoverflow.com/questions/61243319/thoughts-about-train-test-split-for-machine-learning). Disclaimer: I wrote the current top answer. Short version: do the split as early as possible. – mcskinner Apr 17 '20 at 06:02

1 Answers1

2

The normalization procedure as you show it is the only correct approach for every machine learning problem, and LSTM ones are by no means an exception.

When it comes to similar dilemmas, there is a general rule of thumb than can be useful to clarify confusions:

During the whole model building process (including all necessary preprocessing), pretend that you have no access at all to any test set before it comes to using this test set to assess your model performance.

In other words, pretend that your test set comes only after having deployed your model and it starts receiving data completely new and unseen until then.

So conceptually, it may be helpful to move the third line of your first code snippet here to the end, i.e.:

X_train, X_test, y_train, y_test = train_test_split(X, y)
### FORGET X_test from this point on...

X_train = scaler.fit_transform(X_train)

# further preprocessing, feature selection etc...

# model building & fitting...

model.fit(X_train, y_train)

# X_test just comes in:

X_test = scaler.transform(X_test)
model.predict(X_test)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    Thanks! This is indeed a more intuitive way of thinking! – julix Apr 17 '20 at 15:00
  • @desertnaut Thank you! That is very informative! I now have a question. So splitting data into test, train and validation before scaling data allows values over 1 to exist doesn't it? For example, data for training set only contains values 1-10, and scaler is fitted to that. Then if validation and test data contain values over 10 like 15 and even 100, it results in scaled values in those being over 1 I assume? Is that ok? – Masa Feb 07 '23 at 09:53
  • 1
    @Masa you assume correctly. Now, if this is ok or not is another question, and actually not about programming but about ML theory/methodology; please address similar questions to [Data Science SE](https://datascience.stackexchange.com/help/on-topic) - they are actually off-topic here (something that was still unclear in my mind when I answered this one here, which is also off-topic). – desertnaut Feb 08 '23 at 11:23
  • @desertnaut Thank you! As for minmaxscaler, [this video](https://www.youtube.com/watch?v=Vfx1L2jh2Ng) says it is not suitable to time series data, so I would go for standardscaler. – Masa Feb 10 '23 at 04:37