Backpropagation through time in stateful RNNs

Question

If I use a stateful RNN in Keras for processing a sequence of length N divided into N parts (each time step is processed individually),

how is backpropagation handled? Does it only affect the last time step, or does it backpropagate through the entire sequence?
If it does not propagate through the entire sequence, is there a way to do this?

score 7 · Accepted Answer · answered Nov 16 '16 at 13:13

The back propagation horizon is limited to the second dimension of the input sequence. i.e. if your data is of type (num_sequences, num_time_steps_per_seq, data_dim) then back prop is done over a time horizon of value num_time_steps_per_seq Take a look at

https://github.com/fchollet/keras/issues/3669

score 4 · Answer 2 · answered Sep 16 '16 at 12:49

There are a couple things you need to know about RNNs in Keras. At default the parameter return_sequences=False in all recurrent neural networks. This means that at default only the activations of the RNN after processing the entire input sequence are returned as output. If you want to have the activations at every time step and optimize every time step seperately, you need to pass return_sequences=True as parameter (https://keras.io/layers/recurrent/#recurrent).

The next thing that is important to know is that all a stateful RNN does is remember the last activation. So if you have a large input sequence and break it up in smaller sequences (which I believe you are doing), the activation in the network is retained in the network after processing the first sequence and therefore affects the activations in the network when processing the second sequence. This has nothing to do with how the network is optimized, the network simply minimizes the difference between the output and the targets you give.

So you say that dividing your sequences into N parts and using stateful RNNs will give you the same backpropagation behavior as not dividing the sequences and using a standard RNN? I thought that for backpropagation into RNNs, they have to be unrolled, so I wasn't sure how this plays together with statefulness and sequence divsion — Alex, Sep 27 '16 at 08:05

JeeyCi · Answer 3 · 2022-05-25T13:50:16.343

to the Q1: how is backpropagation handled? (as so as RNN is not only fully-connected vertically as in basic_NN, but also considered to be Deep - having also horizontal backprop connections in hidden layer)

Suppose batch_input_shape=(num_seq, 1, data_dim) - "Backprop will be truncated to 1 timestep , as the second dimension is 1. No gradient updates will be performed further back in time than the second dimension's value." - see here

Thus, if having time_step >1 there - gradient WILL update further back in time_steps assigned in second_dim of input_shape

set return_sequences=True for all recurrent layers except the last one (that use as needed output or Dense further to needed output) -- True is needed to have transmissible sequence from previous to the next rolled at +1 in sliding_window -- to be able to backprop according already estimated weights
return_state=True is used to get the states returned -- 2 state tensors in LSTM [output, state_h, state_c = layers.LSTM(64, return_state=True, name="encoder")] or 1 state tensor in GRU [incl. in shapes] -- that "can be used in the encoder-decoder sequence-to-sequence model, where the encoder final state is used as the initial state of the decoder."...

But remember (for any case): Stateful training does not allow shuffling, and is more time-consuming compared with stateless

p.s.

as you can see here -- (c,h) in tf or (h,c) in keras -- both h & c are elements of output, thus both becoming urgent in batched or multi-threaded training

I've corrected my answer about *encoder-decoder* in seq2seq model & *return_state=True* according [TF manual](https://keras.io/guides/working_with_rnns/#outputs-and-states) — JeeyCi, May 25 '22 at 11:15
Cross-batch statefulness (*stateful=True* ) is usefull when dividing the whole sequence by batches, that are of course being initialized manually and separately... so, to save connection between batches & to have the opportunity to **backprop correctly** over the whole sequence divided by batches -- use True also to this flag — JeeyCi, May 25 '22 at 13:10

Backpropagation through time in stateful RNNs

3 Answers3

Linked