Flow of states in LSTM

Question

I read that the internal state of LSTMs flows as follows:

it is always passed within a batch, so from the last timestamp of the i-th sample to the first of the i+1st
if the LSTM is stateful then the state is passed between batches, so the memory at the last timestamp of batch_k[i] is passed to the first timestamp of batch_{k+1}[i], for all indices i.

For me, this raises several questions. (Please correct me if my understanding is wrong)

Does this mean that the first timestamp of the (i+1)st sample needs to be the sucessor of the last timestep of sample i? (for all i)
Along the same lines, does the first timestamp of the i-th sample in batch k+1 have to be the sucessor of the last timestamp of the i-th sample in batch k?
If the first two conclusions are correct, then for stateful LSTMs we can NEVER shuffle anything and for the non-stateful ones we can at most shuffle the batches, but not the samples within batches, correct?
Why do we split the batch in samples of more than one timestep, anyway? If the above is correct, then the procedure 'within a sample' is the same as 'within a batch', so we might as well use samples of one timestep each.

There is also contradicting information out there as to whether the state in non-stateful LSTMs is passed within a batch: https://stackoverflow.com/questions/45623480/stateful-lstm-when-to-reset-states?rq=1 claims it does not, while https://stackoverflow.com/questions/41695117/understanding-stateful-lstm claims it does. PLEASE HELP!! — karu, Mar 14 '18 at 12:12

score 0 · Accepted Answer · answered Mar 14 '18 at 12:49

Question 1

Not true. Sample s is not related to sample s+1 in the same batch. They're independent.

This means that the sample s of the batch b+1 needs to be the sucessor of the sample s of the batch b.

The samples will be processed in parallel, and batches must keep the same order. (That's why the documentation says you need shuffle=False when training stateful=True layers).

Question 2

This is true :)

Question 3

Partially correct. With stateful=True we cannot shuffle the batches (if there are going to be more than one batch).

But with stateful=False this really doesn't matter, because none of the samples will be related to each other. (Each sample in the batch is completely independent)

Question 4

Since "samples" in a batch are indepentent from each other there is a main reason to have many samples in a batch:

You have many independent sequences instead of just one sequence

But you may want to divide each sequence in many batches regarding the "length/timesteps". You would do this if:

Your sequences are way too long to fit your memory, so you load them partially and process them partially
Your model is predicting the future indefinitely, and you need it to predict the step t(n+1) to pass it as an input before it can produce the step t(n+2).

So, you can indeed use samples of one timestep each in stateful=True layers.

These answers may also help you:

Flow of states in LSTM

1 Answers1