I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:
input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?
I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.
EDIT:
- So from my understanding there are two concepts we can call "memory" here. The weights that are updated through BPTT and the hidden state of the LSTM cell. For a given window of timesteps the LSTM can remember what the previous timestep was, this is what the hidden state is for I think. Now the weight update does not directly reflect memory if I'm understanding this correctly.
- The size of the hidden state, in other words how much the LSTM remembers is determined by the batch size, which in this case is one whole file, but other question/answers (https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not know that 4 comes after 3 because they are in different windows, even though they belong to the same batch. So I'm still unsure how exactly memory is maintained in the LSTM
- It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?