Do LSTMs remember previous windows or is the hidden state reset?

Question

I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:

input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
             activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))

I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?

I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.

EDIT:

So from my understanding there are two concepts we can call "memory" here. The weights that are updated through BPTT and the hidden state of the LSTM cell. For a given window of timesteps the LSTM can remember what the previous timestep was, this is what the hidden state is for I think. Now the weight update does not directly reflect memory if I'm understanding this correctly.
The size of the hidden state, in other words how much the LSTM remembers is determined by the batch size, which in this case is one whole file, but other question/answers (https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not know that 4 comes after 3 because they are in different windows, even though they belong to the same batch. So I'm still unsure how exactly memory is maintained in the LSTM
It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?

I believe it resets every time a new sequence is passed. So if you have 2345 sequences of 95 elements. Then everytime you pass a new window of 95 elements the state is reset. You can try passing the `stateful=True` parameter which maintains the state along between batches, but then you'll need to manually reset the state if you ever want it to be reset. — Sean, May 19 '22 at 14:50

David Parks · Accepted Answer · 2022-05-20T18:24:00.607

1

What you are describing is called "Back Propagation Through Time", you can google that for tutorials that describe the process.

Your concern is justified in one respect and unjustified in another respect.

The LSTM is capable of learning across multiple training iterations (e.g. multiple 15 step intervals). This is because the LSTM state is being passed forward from one iteration (e.g. multiple 15 step intervals) to the next iteration. This is feeding information forward across multiple training iterations.

Your concern is justified in that the model's weights are only updated with respect to the 15 steps (plus any batch size you have). As long as 15 steps is long enough for the model to catch valuable patterns, it will generally learn a good set of weights that generalize well beyond 15 steps. A good example of this is the Shakespeare character recognition model described in Karpathy's, "The unreasonable effectiveness of RNNs".

In summary, the model is learning to create a good hidden state for the next step averaged over sets of 15 steps as you have defined. It is common that an LSTM will produce a good generalized solution by looking at data in these limited segments. Akin to batch training, but sequentially over time.

I might note that 100 is a more typical upper limit for the number of steps in an LSTM. At ~100 steps you start to see a vanishing gradient problem in which the earlier steps contribute nearly nothing to the gradient.

Note that it is important to ensure you are passing the LSTM state forward from training step to training step over the course of an episode (any contiguous sequence). If this step was missed the model would certainly suffer.

edited May 20 '22 at 18:24

answered May 19 '22 at 20:54

David Parks

30,789
47
185
328

So from my understanding there are two concepts we can call "memory" here. The weights that are updated through BPTT and the hidden state of the LSTM cell. For a given window of timesteps the LSTM can remember what the previous timestep was, this is what the hidden state is for I think. Now the weight update does not directly reflect memory if I'm understanding this correctly. I think this answer https://stackoverflow.com/a/50235563/13469674 slightly contradicts what you are saying. I'm not sure, hence my confusion surrounding the hidden states and the "memory" of the LSTM – DPM May 20 '22 at 08:53
The size of the hidden state, in other words how much the LSTM remembers is determined by the batch size, which in this case is one whole file, but other question/answers (https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm) state that if we have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not know that 4 comes after 3 because they are in different windows, even though they belong to the same batch. So I'm still unsure how exactly memory is maintained in the LSTM – DPM May 20 '22 at 08:59
1

Don't think of BPTT as having any memory. Each step in the LSTM passes forward two vectors, a hidden state, and a cell state (it's fine to conceptually think of this as one). Each step is independently calculated, the step takes an input and a state vector and produces and output and a new state vector. BPTT comptutes gradients with respect to the steps output, but also with respect to the hidden state it produced (which eventually reaches a loss function in later steps). BPTT is simply an averaging of multiple gradients to update the weights (both w.r.t. time and w.r.t. output). – David Parks May 20 '22 at 18:16
1

Be specific when you say "batch size". LSTMs have a sequence length (a kind of batch size) and a batch size in terms of number of samples (it's common to use a single sample in LSTM training). The size of the hidden state is simply a hyperparameter and is more commonly impacted by the amount of training data you have (small dataset/small state). Make it too big and you overfit, make it too small and you underfit. Training batch size (not per-update sequence length) is also governed heavily by dataset size in my experience. Large batches for small datasets, small batches for large datasets. – David Parks May 20 '22 at 18:21
Yes the size of the hidden state is a hyperparameter. But that still does not answer my question: an epoch is divided by steps, each step is a file. Now this files has several windows that ARE FED INTO THE lstm. if we have two windows for instance: [1,2,3] and [4,5,6] the LSTM does not know that 4 comes after 3 because they are in different windows, even though they belong to the same file (i.e training step). If this is the case then the LSTM's memory only lasts per window, meaning that each window contributes to the gradient and BPTT does the average you mention? – DPM May 23 '22 at 14:35
1

It's very important to pass the output hidden state from `[1,2,3]` to the input of `[4,5,6]`. When you start the a new sequence the hidden state should be all zeros. – David Parks May 23 '22 at 16:52
So the LSTM layer should be stateful then? As in return_states should be set to True in order to pass the hidden state from sequence to sequence – DPM May 24 '22 at 12:09
1

Yes, this is absolutely necessary for long sequences/episodes as you're describing. It's a detail you typically need to manage. – David Parks May 24 '22 at 16:31

Do LSTMs remember previous windows or is the hidden state reset?

1 Answers1