How to train an encoder-decoder model?

Question

I do not really understand the obviously (or actually the same?) training procedures for training a LSTM encoder-decoder.

on the one hand in the tutorial they use a for loop for training: https://www.tensorflow.org/tutorials/text/nmt_with_attention#training

but here https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

(the first model )

just uses a simple

# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)

Here, both procedure say, they are training via a teacher forcing method.

But I cannot understand why both ways are the same?

Why I can train an encoder decoder without a for loop like normal model training though I need to previous decoding step for training next decoding step?

Pedro Marques · Answer 1 · 2020-12-17T14:23:17.940

1

In an LSTM, the output of a time step depends only on the state and the previous time steps. In the second link (keras blog) what is happening during training is that the final state is not being used... only the per-step vector. During inference the state is being saved from one iteration to the next.

The following answer explains the concept of time steps in an LSTM What exactly is timestep in an LSTM Model?

This is a useful picture for the sake of discussion.

To reconcile with the LSTM Keras API:

When one specifies return_sequences=True, keras returns the per-time step h0,hN vectors above;
When one specifies return_state=True, the last side output is returned (the right arrow out of the right most A block).

In this image, the output of step N depends only on [x0, xN].

When you have a model as defined in your link that only depends on the h values in the picture above, them when one calculates the losses/gradients the math is the same whether you do it in one shot or an a loop.

This would not hold if the final LSTM state was used (the side arrow from the right most A block in the picture).

From the Keras LSTM API documentation:

return_state: Boolean. Whether to return the last state in addition to the output. Default: False.

The relevant comment in the code:

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the 
# return states in the training model, but we will use them in inference.

You can try to look at a sequence of length 2. If you calculate the gradients of the prediction of time-step 0 and 1 in one-shot, as far as the LSTM is concerned, the gradient for h0 (output of time step 0) is only dependent on the corresponding input; the gradient of h1 (output of time step 1) is dependent on x0 and x1 and the transformations though the LSTM. If you calculate the gradient time step by time step, you end up with the exact same calculation.

If you look at transformer models, you will see that they use a mask to mask out the sequence in order to ensure that step N only depends on previous step N.

edited Dec 17 '20 at 14:23

answered Dec 14 '20 at 19:52

Pedro Marques

2,642
1
10
10

Sorry, but diffiucltt to understand as you use non standard words and espec for RNN the words are not exact: hidden vector, cell states. I dont know what you mean by per-step vector... – ctiid Dec 17 '20 at 09:57
Why (and how do you know that) is the "final" state (whatever you mean by that) not used by the model? And, why does it only depend on the h values. I assume you per-step h's are the hidden states? – ctiid Dec 17 '20 at 13:57
@cltid as per answer: the h0..hN outputs are referred to as return_sequences in the Keras API; the right-most right arrow of the 'A' block is referred to as the last or final state by the Keras API. – Pedro Marques Dec 17 '20 at 14:28
Is the right most right arrow the arrow before the last A-cell? And, the last state meant is the last hidden state (as meant with return sequences the output os the last cell? – ctiid Dec 17 '20 at 14:38
Each A cell in the diagram takes an input (xN), produces an output (hX) in the diagram and a state (right arrow). The ```last state``` as per the Keras API is the right arrow from the right most (end of sequence) A cell. In the diagram this arrow is not shown. – Pedro Marques Dec 17 '20 at 14:42
The wording is somehow different. I just wonder in the blog what is the "internal state" in: it processes the input sequence and returns its own internal state – ctiid Dec 17 '20 at 15:29
The internal state is the right pointing arrows. But actually the model uses the last output of the LSTM + the internal state. ```encoder_states = [state_h, state_c]``` This corresponds in the picture above to hN + the right output of the last 'A' block. – Pedro Marques Dec 17 '20 at 15:47
Sorry, but I am not sure if this answers the issue. In the comment of the blog "when not using teacher forcing" I think it is the same training method as on tf website. For me it is still not clear why training with a single fit statement or within the for loop is the "same". – Dec 20 '20 at 18:46
@PedroMarques. Thanks. So it should be possible to rewrite the first model from the blog (with fit statement) as the same model but rrained via a for loop? When I use a for loop, do I need then need to make usage of the single LSTMCellls instead of the layer? like here: https://www.tensorflow.org/tutorials/structured_data/time_series#advanced_autoregressive_model – Dec 25 '20 at 19:43

How to train an encoder-decoder model?

1 Answers1