I'm creating a video captioning seq2seq model.
My encoder inputs are video features, and my decoder inputs are captions, beggining with a token and padded with tokens.
Problem: During the teacher forcing training period, after few iterations, it only outputs tokens, for the rest of epochs.
My problem is very similar to those stack overflow posts:
- Seq2Seq model learns to only output EOS token (<\s>) after a few iterations
- Tensorflow seq2seq chatbot always give the same outputs
However, I'm sure that I'm using the right shapes for computing tf.contrib.seq2seq.sequence_loss.
My inputs seem also correct: - my ground-truth target captions begin with token and are padded with tokens. - the predicted captions don't start with tokens
I tried to:
- use another loss functions (mean of f.nn.sparse_softmax_cross_entropy_with_logits)
- keep the token at the end of captions but doing padding with special tokens. So my captions looks like:
<start> This is my caption <end> <pad> <pad> ... <pad>
. This resulted to NaN logits... - change my embeddings method
- take more data. I trained the model with 512 videos and a batch size of 64
Here is my simple model:
def decoder(target, hidden_state, encoder_outputs):
with tf.name_scope("decoder"):
embeddings = tf.keras.layers.Embedding(vocab_size, embedding_dims, name="embeddings")
dec_embeddings = embeddings
decoder_inputs = embeddings(target)
decoder_gru_cell = tf.nn.rnn_cell.GRUCell(dec_units, name="gru_cell")
output_layer = tf.layers.Dense(vocab_size, kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
# Training decoder
with tf.variable_scope("decoder"):
training_helper = tf.contrib.seq2seq.TrainingHelper(decoder_inputs, batch_size*[max_length])
training_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_gru_cell, training_helper, decoder_initial_state, output_layer)
training_decoder_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder, maximum_iterations=max_length)
# This is the logits
return training_decoder_outputs.rnn_outputs
For the caption:
Real caption:
<start> a girl and boy flirt then eat food <end> <end> <end> <end>
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end>
<end> <end> <end>
Here are the predictions:
Epoch 1:
Predicted caption:
show show show show show show show show show show show show show show
show show show show show show show show show show show show show
Epoch 2:
Predicted caption:
the the the the the the the the the the the the the the the the the the
the the the the the the the the the
...
Epoch 7:
Predicted caption:
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end>
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end>
<end> <end> <end>
And it stays like Epoch 7 for every other epochs...
Note that my model seems to optimize loss correctly !