Why is my attention model worse than non-attention model

Question

My task was to convert english sentence to German sentence. I first did this with normal encoder-decoder network, on which I got fairly good results. Then, I tried to solve the same task with the same exact model as before, but with Bahdanau Attention in it. And, the model without attention outperformed the one with the attention.

The Model's loss without attention gone from approximately 8.0 to 1.4 in 5 epochs and gone to 1.0 in 10 epochs and the loss was still reducing but at a slower rate.

The Model's loss with attention gone from approximately 8.0 to 2.6 in 5 epochs and was not learning much more.

None of the models were overfitting as the validation loss was also decreasing in both the models.

Each English sentence had 47 words in it (after padding), and each German sentence had 54 words in it (after padding). I had 7000 English and 7000 German sentence in the training set and 3000 in the validation set.

I tried almost everything like: different learning rates, different optimizer, different batch sizes, different activation functions I used in the model, tried applying batch and layer normalization, and different number of LSTM units for the encoder and decoder, but nothing makes much difference, except the normalization and increasing the data, in which the loss goes down till approx 1.5 but then again stops learning!

Why did this happened? Why did the Model with Bahdanau attention failed while the one without any kind of attention was performing well?

Edit 1 - I tried applying LayerNormalization before the attention, after the attention and both before and after the attention. The results were approximately the same in each case. But, this time, the loss went from approx 8.0 to 2.1 in 5 epochs, and was again not learning much. But most of the learning was done in 1 epoch as at the end of 1 epoch it reached a loss of approx 2.6 and then reached 2.1 in the next epoch, and then again not learning much.

Still, the model without any attention outperforms the one with both attention and LayerNormzalization. What could be the reason to this? Are the results that I got even possible? How can a normal encoder-decoder network without any kind of normalization, without any dropout layer perform better than the model with both attention and LayerNormalization?

Edit 2 - I tried increasing the data (I did it 7 times more than the previous one), this time, both the models performance improved a lot. But still, the model without attention performed better than the model with attention. Why is this happening?

Edit 3 - I tried to debug the model by first passing just one sample from the whole training dataset. The loss started at approx 9.0 and was reducing and converging at 0. Then, I tried by passing 2 samples, the loss again started at approx 9.0, but, this time, it was just wandering between 1.5 and 2.0 for the first 400 epochs and then reducing slowly. This is a plot of how the loss reduces when I trained it with just 2 samples:

This is a plot of how the loss reduces when I trained it with just 1 sample:

It seems that you have a normalization issue. Do you apply any kind of normalization in your model? if not, try to apply LayerNormalization after the attention layer (or before, test both) and then compare both ways. — Minions, Oct 25 '20 at 09:56
@Ghanem I tried what you said and I have added the results of LayerNormalization in the edit. — , Oct 25 '20 at 12:59
`Are the results that I got even possible?` why not! adding attention or any auxiliary layer doesn't mean a better performance. do you use word embeddings? which one? — Minions, Oct 25 '20 at 13:45
@Ghanem Yes, I use word embeddings. But, I do not use any pre-trained word embeddings. I use my own embedding using the `tf.keras.layers.Embedding` Layer — , Oct 25 '20 at 14:46
Ok, so you train them. Try to use pretrained embeddings, worthy. — Minions, Oct 25 '20 at 14:53
@Ghanem I have never used pretrained embeddings. Which pretrained embedding should I use for this task? And even if I use it, it will increase the performance of both the models (one with attention and one without attention). And still, the model without attention would be better than the other. And, I want to know that why is the model without attention performing better than the one with attention. — , Oct 25 '20 at 15:08
Does it work? I mean if both models are awful maybe it's not about attention. Maybe you need more data or more variations of similar sentences (with some sort of data augmentation) or pre-training of your embeddings on a larger datasets to compensate the small dataset size. — Mehdi, Oct 26 '20 at 05:13
@Mehdi You can say that the model with attention is awful, but the one without attention is not because it's loss reached 1.0 in 10 epochs and was still reducing — , Oct 26 '20 at 07:06
Ok. I meant it is possible that data is not enough to learn translations regardless of the model choice. So, if that was the case, then the loss would be irrelevant. — Mehdi, Oct 26 '20 at 07:12
@Mehdi I increased the data, and both the models performance increased a lot. But still, the model without attention performs better than the one with attention. I have added this information in the post as an edit. I am not understanding, why is this happening? — , Oct 26 '20 at 11:01
@NITINAGARWAL can you please share the code of both your models (with or without attention)? Maybe, there is a subtle bug in the implementation which makes the model with attention worse. — David Dale, Oct 29 '20 at 09:46
@DavidDale I had the same doubt! So I asked if it was correct implementation of bahdanau attention on Code Review: [Here](https://codereview.stackexchange.com/questions/251056/convert-an-english-sentence-to-german-using-bahdanau-attention) — , Oct 29 '20 at 11:14
Could you please also show the code of model without attention so that we could directly compare them? I suspect your decoder loses information about its previous state, but it's better to check. — David Dale, Oct 31 '20 at 07:13
@DavidDale Yes, it was an implementation issue... Fixing that, makes the attention model perform better than the normal encoder-decoder model! — , Oct 31 '20 at 12:45

score 3 · Accepted Answer · answered Oct 31 '20 at 12:46

3

Thank you everyone for the help.... It was an implementation issue... Fixing that, makes the attention model perform better than the normal encoder-decoder model!

answered Oct 31 '20 at 12:46

1

What was the implementation issue? Can you please update this query with your answer, as I'm facing similar issue with Attention model not performing good as compared to the basic LSTM based encoder decoder model? – Sukhmani Kaur Thethi Nov 16 '21 at 12:31

Why is my attention model worse than non-attention model

1 Answers1