2

I'm creating a video captioning seq2seq model.

My encoder inputs are video features, and my decoder inputs are captions, beggining with a token and padded with tokens.

Problem: During the teacher forcing training period, after few iterations, it only outputs tokens, for the rest of epochs.

My problem is very similar to those stack overflow posts:

However, I'm sure that I'm using the right shapes for computing tf.contrib.seq2seq.sequence_loss.

My inputs seem also correct: - my ground-truth target captions begin with token and are padded with tokens. - the predicted captions don't start with tokens

I tried to:

  • use another loss functions (mean of f.nn.sparse_softmax_cross_entropy_with_logits)
  • keep the token at the end of captions but doing padding with special tokens. So my captions looks like: <start> This is my caption <end> <pad> <pad> ... <pad>. This resulted to NaN logits...
  • change my embeddings method
  • take more data. I trained the model with 512 videos and a batch size of 64

Here is my simple model:

def decoder(target, hidden_state, encoder_outputs):
  with tf.name_scope("decoder"):
    embeddings = tf.keras.layers.Embedding(vocab_size, embedding_dims, name="embeddings")  
    dec_embeddings = embeddings
    decoder_inputs = embeddings(target) 

    decoder_gru_cell = tf.nn.rnn_cell.GRUCell(dec_units, name="gru_cell")

    output_layer = tf.layers.Dense(vocab_size, kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))

    # Training decoder
    with tf.variable_scope("decoder"):
      training_helper = tf.contrib.seq2seq.TrainingHelper(decoder_inputs, batch_size*[max_length])
      training_decoder = tf.contrib.seq2seq.BasicDecoder(decoder_gru_cell, training_helper, decoder_initial_state, output_layer)
      training_decoder_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder, maximum_iterations=max_length)

  # This is the logits
  return training_decoder_outputs.rnn_outputs

For the caption:

Real caption:
<start> a girl and boy flirt then eat food <end> <end> <end> <end> 
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> 
<end> <end> <end> 

Here are the predictions:

Epoch 1:

Predicted caption:
show show show show show show show show show show show show show show     
show show show show show show show show show show show show show 

Epoch 2:

Predicted caption:
the the the the the the the the the the the the the the the the the the 
the the the the the the the the the 

...

Epoch 7:

Predicted caption:
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> 
<end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> <end> 
<end> <end> <end> 

And it stays like Epoch 7 for every other epochs...

Note that my model seems to optimize loss correctly !

wakobu
  • 318
  • 1
  • 11
  • I'm facing the same problem, did you solve the problem? – luvwinnie Jun 08 '20 at 00:20
  • Hi, I solved the problem but unfortunately, I totally forgot what was the mistake (I wrote the post almost 1 year ago). I remember it was a pretty silly mistake. Sorry, I should have answered my own post at the time. I published the code on github You can maybe compare this chunk of code. Please keep me in touch. https://github.com/nbusser/VideoCaptioning/blob/master/seq2seq_model.py – wakobu Jun 08 '20 at 00:38
  • Thank you for sharing the code, I know that maybe you have forgotten the mistake,but do you remember vaguely on the mistake is related on input/output token or the loss function or anything? I have no idea how to fix this right now ^ ^; – luvwinnie Jun 08 '20 at 00:49
  • There is only one differ ency between the two codes: in the full code, I gave length vector to the TrainingHelper (line 163) instead of giving a vector of constant values (batch_size*[max_length]). – wakobu Jun 08 '20 at 00:59
  • Ok I found some old paper notes I wrote last year. I refer to a trouble resulting to NaN and prediction, solved in 2 days. I'm pretty sure it refers to this specific problem. The notes says I solved it with "format caption, pad". I double checked the code: try to add and tokens to your vocabulary. My inputs were under this format: " A B C ", padded accordingly to the max-sized caption (tokenization_handle.py line 49). So, the longest caption should contains no token. – wakobu Jun 08 '20 at 01:12
  • Now it's clear in my head: you should combine my two last comments and it should work. So, format your sentences with tokens but do not include them in the length. For example, " A B C " sequence have a length of 5 (not 8). Then, feed your TrainingHelper with the correct lengths for each inputs. The problem here was that I fed the trainer with a insanely big quantity of " ..." tokens sequence, so, it was only predicting tokens ! Is it clear ? – wakobu Jun 08 '20 at 01:33
  • Thank you for answering the question! My input is almost same as yours which is A B C . However, I'm trying to use tf.keras to implement the seq2seq, just wondering did you ever tried to use tf.keras to implement. Firstly, I include the and to my vocab, it solved the NaN problem, I didn't use the TrainingHelper which was in tensorflow v1, it has been migrated to tensorflow addons, what's exactly the uses of the training helper. This is my code on training step. https://colab.research.google.com/drive/1EZPFUiZlnwfC-g9PWdAeEJqLd-GIDsUT?usp=sharing – luvwinnie Jun 08 '20 at 01:57
  • It seems like there is a new seq2seq example in tensorflow which using the TrainingSampler(V1 in TrainingHelper) ! I will checkout this and try to implement with the new API! Maybe we can share some code one for tensorflow v1 and tensorflow v2 once i solve this problem! But at least i finally know where's the problem is, Thank you so much ! https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt – luvwinnie Jun 08 '20 at 02:15
  • 1
    You're welcome. Keep me in touch. About your previous question: if I remember well, TrainingHelper is used to feed decoder with ground truth captions. It is easier to use it because it can simply handle variable lengths sequences. – wakobu Jun 08 '20 at 02:47

0 Answers0