I am creating an LSTM for sentiment analysis with (a subset of) the IMDB database, using Keras. My training, validation and testing accuracy dramatically improves if I add a flatten layer before the final dense layer:
def lstm_model_flatten():
embedding_dim = 128
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.LSTM(128, return_sequences = True, dropout=0.2))
# Flatten layer
model.add(layers.Flatten())
model.add(layers.Dense(1,activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
return model
This overfits quickly, but the validation accuracy gets up to around 76%:
Model: "sequential_43"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_42 (Embedding) (None, 500, 128) 4768256
_________________________________________________________________
lstm_63 (LSTM) (None, 500, 128) 131584
_________________________________________________________________
flatten_10 (Flatten) (None, 64000) 0
_________________________________________________________________
dense_40 (Dense) (None, 1) 64001
=================================================================
Total params: 4,963,841
Trainable params: 4,963,841
Non-trainable params: 0
_________________________________________________________________
Epoch 1/7
14/14 [==============================] - 26s 2s/step - loss: 0.6911 - accuracy: 0.5290 - val_loss: 0.6802 - val_accuracy: 0.5650
Epoch 2/7
14/14 [==============================] - 23s 2s/step - loss: 0.6451 - accuracy: 0.6783 - val_loss: 0.6074 - val_accuracy: 0.6950
Epoch 3/7
14/14 [==============================] - 23s 2s/step - loss: 0.4594 - accuracy: 0.7910 - val_loss: 0.5237 - val_accuracy: 0.7300
Epoch 4/7
14/14 [==============================] - 23s 2s/step - loss: 0.2566 - accuracy: 0.9149 - val_loss: 0.4753 - val_accuracy: 0.7650
Epoch 5/7
14/14 [==============================] - 23s 2s/step - loss: 0.1397 - accuracy: 0.9566 - val_loss: 0.6011 - val_accuracy: 0.8050
Epoch 6/7
14/14 [==============================] - 23s 2s/step - loss: 0.0348 - accuracy: 0.9898 - val_loss: 0.7648 - val_accuracy: 0.8100
Epoch 7/7
14/14 [==============================] - 23s 2s/step - loss: 0.0136 - accuracy: 0.9955 - val_loss: 0.8829 - val_accuracy: 0.8150
Using the same architecture without the flatten layer (and using return_sequences = False on the LSTM layer) only produces a validation accuracy of around 50%.
The comments on this post recommend that return_sequences = False
is used before the dense layer, rather than a flatten layer.
But why is that the case? Is it ok to use a flatten layer if it improves my model? What exactly is the flatten layer doing here, and why does it improve the accuracy?