keras - seq2seq model predicting same output for all test inputs

Question

I am trying to build a seq2seq model using LSTM in Keras. Currently working on the English to French pairs dataset-10k pairs(orig dataset has 147k pairs). After training is completed while trying to predict the output for the given input sequence model is predicting same output irrespective of the input seq. Also using separate embedding layer for both encoder and decoder. What I observe is the predicted words are nothing but the most frequent words in the dataset and they are displayed in the decreasing order of their frequency. eg: 'I know you', 'Can we go ?', 'snap out of it' -- for all these input seq the output is -- 'je suis en train' (same output for all three).

Can anyone help me what could be the reason why model is behaving like this. Am i missing something basic ?

I tried following with batchsize=32, epoch=50,maxinp=8, maxout=8, embeddingsize=100.

    encoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='encoder_inputs')
    encoder_lstm1 = LSTM(units=HIDDEN_UNITS, return_state=True, name="encoder_lstm1" , stateful=False, dropout=0.2)
    encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm1(encoder_inputs)

    encoder_states = [encoder_state_h, encoder_state_c]

    decoder_inputs = Input(shape=(None, GLOVE_EMBEDDING_SIZE), name='decoder_inputs')
    decoder_lstm = LSTM(units=HIDDEN_UNITS, return_sequences=True, return_state=True, stateful=False, 
                        name='decoder_lstm', dropout=0.2)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = Dense(self.num_decoder_tokens, activation='softmax', name='decoder_dense')
    decoder_outputs = decoder_dense(decoder_outputs)

    self.model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    print(self.model.summary())

    self.model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

    Xtrain, Xtest, Ytrain, Ytest = train_test_split(input_texts_word2em, self.target_texts, test_size=0.2, random_state=42)


    train_gen = generate_batch(Xtrain, Ytrain, self)
    test_gen = generate_batch(Xtest, Ytest, self)

    train_num_batches = len(Xtrain) // BATCH_SIZE
    test_num_batches = len(Xtest) // BATCH_SIZE

    self.model.fit_generator(generator=train_gen, steps_per_epoch=train_num_batches,
                epochs=NUM_EPOCHS,
                verbose=1, validation_data=test_gen, validation_steps=test_num_batches ) #, callbacks=[checkpoint])        

    self.encoder_model = Model(encoder_inputs, encoder_states)


    decoder_state_inputs = [Input(shape=(HIDDEN_UNITS,)), Input(shape=(HIDDEN_UNITS,))]
    decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_state_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    self.decoder_model = Model([decoder_inputs] + decoder_state_inputs, [decoder_outputs] + decoder_states)

Update:: I have run on 147k dataset with 5 epochs and the results vary for every input. Thanks for the help.

However now I am running same model with another dataset with input and output sequences containing average no of words as 170 and 100 respectively after cleaning (removing stopwords and stuff). This dataset has around 30k records and if i run even with 50 epochs, the results are same again for every test sentence. So what are my next options for me to try. I was expecting atleast different output for different inputs (even if it is wrong) but same output is adding more frustration whether the model is not learning properly. Any answers ??

What is the value of HIDDEN_UNITS. 10K pairs is very small dataset, the HIDDEN_UNITS should also be set to very small. If it is large, the network will end up trained with high bias. — James Dong, Apr 04 '18 at 00:42
Hidden units is set to 100 (equivalent to embedding size) . How small should it be set for dataset of 10k ? Is there any relationship between embedding size and hidden units ? I have read somewhere lstm hidden units should be equal to size of input (embedding size). — Sunil, Apr 04 '18 at 04:49
It is not necessary that lstm size should be equal to embedding size. For small datasets, you can try setting vocabulary size to 5000, embedding size and lstm szie to 64. Though you may not get descent result, it really works. — James Dong, Apr 04 '18 at 10:03

James_SO · Answer 1 · 2021-12-28T16:49:27.397

An LSTM-based encoder-decoder (Seq2Seq) that is correctly setup may produce the same output for any input when the net has not trained for enough epochs. I can reliably reproduce the "same stupid output no matter what the input" result simply by reducing my number of epochs from 200 to 30.

One thing that may be confusing is that default accuracy measures may not seem to improve much, as you go from 30 to 150 epochs. However, in cases where you are using Seq2Seq for chatbot or translation tasks, something like the BLEU score is more relevant. Even more relevant is your own evaluation of the 'realism' of the responses. These evaluations are done at inference time, not during training - but they can influence your evaluation of whether or not the model has trained sufficiently.

Another thing to consider is that, in my case, the training data was movie dialogue from the Cornell set. Chatbot dialogue is supposed to be helpful - movie dialogue has no such mandate - it's just supposed to be interesting. So we do have a mismatch here between training and use.

In any event, here are examples of the exact same net trained after 30 vs. 150 epochs, responding to the same inputs.

30 epochs

150 epochs

In this case, the default keras.Model.fit() accuracy reported went from .2118 (after 30 epochs) to .2243 (after 150), but clearly the inputs are now getting differentiated. If this were a classifier and we were just looking at the training accuracy (and didn't look at sample inferences) we might reasonably assume that all those additional training epochs were pointless.

But think of it this way - evaluating the ability of a model to, for example, classify a picture of a bird as a bird with labeled data is quite different than evaluating the ability of a model to encapsulate the idea of a sentence or phrase and respond appropriately with a sequence of characters that forms a coherent thought, so the default training metrics aren't as useful.

Another thing we might suspect when we see a oscillating accuracy is that our learning rate is too high or not sufficiently adaptive. I was using rmsprop - maybe Adam would address this problem? But after 30 epochs with Adam, acc was .2085 and starting to oscillate.

At this point, looking at my results, it's clear that training on movie dialogue is just going to produce movie-dialogue-ish text which isn't inherently helpful and not that easy to assess in terms of 'accuracy'. You would expect movie dialogue to have a 'spark of life' - originality, the unexpected, a difference in tone between characters, etc. So at this point, if I want a better chatbot, I need more applicable training data.

keras - seq2seq model predicting same output for all test inputs

1 Answers1

30 epochs

150 epochs