What is this flatten layer doing in my LSTM?

Question

I am creating an LSTM for sentiment analysis with (a subset of) the IMDB database, using Keras. My training, validation and testing accuracy dramatically improves if I add a flatten layer before the final dense layer:

def lstm_model_flatten():
    embedding_dim = 128
    model = Sequential()
    model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
    model.add(layers.LSTM(128, return_sequences = True,  dropout=0.2)) 
    # Flatten layer
    model.add(layers.Flatten())
    model.add(layers.Dense(1,activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.summary()
    return model

This overfits quickly, but the validation accuracy gets up to around 76%:

Model: "sequential_43"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_42 (Embedding)     (None, 500, 128)          4768256   
_________________________________________________________________
lstm_63 (LSTM)               (None, 500, 128)          131584    
_________________________________________________________________
flatten_10 (Flatten)         (None, 64000)             0         
_________________________________________________________________
dense_40 (Dense)             (None, 1)                 64001     
=================================================================
Total params: 4,963,841
Trainable params: 4,963,841
Non-trainable params: 0
_________________________________________________________________
Epoch 1/7
14/14 [==============================] - 26s 2s/step - loss: 0.6911 - accuracy: 0.5290 - val_loss: 0.6802 - val_accuracy: 0.5650
Epoch 2/7
14/14 [==============================] - 23s 2s/step - loss: 0.6451 - accuracy: 0.6783 - val_loss: 0.6074 - val_accuracy: 0.6950
Epoch 3/7
14/14 [==============================] - 23s 2s/step - loss: 0.4594 - accuracy: 0.7910 - val_loss: 0.5237 - val_accuracy: 0.7300
Epoch 4/7
14/14 [==============================] - 23s 2s/step - loss: 0.2566 - accuracy: 0.9149 - val_loss: 0.4753 - val_accuracy: 0.7650
Epoch 5/7
14/14 [==============================] - 23s 2s/step - loss: 0.1397 - accuracy: 0.9566 - val_loss: 0.6011 - val_accuracy: 0.8050
Epoch 6/7
14/14 [==============================] - 23s 2s/step - loss: 0.0348 - accuracy: 0.9898 - val_loss: 0.7648 - val_accuracy: 0.8100
Epoch 7/7
14/14 [==============================] - 23s 2s/step - loss: 0.0136 - accuracy: 0.9955 - val_loss: 0.8829 - val_accuracy: 0.8150

Using the same architecture without the flatten layer (and using return_sequences = False on the LSTM layer) only produces a validation accuracy of around 50%.

The comments on this post recommend that return_sequences = False is used before the dense layer, rather than a flatten layer.

But why is that the case? Is it ok to use a flatten layer if it improves my model? What exactly is the flatten layer doing here, and why does it improve the accuracy?

I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). — desertnaut, Apr 05 '21 at 11:56
I am afraid you missed the point; please notice that the issue of a question being on-topic here or not has only to do with the *content* of the question, and it can never be resolved by tag manipulation alone. If the issue was the tag itself, I would have just removed it myself without any further action. — desertnaut, Apr 05 '21 at 12:00
The change in accuracy is likely coming more from the change of using `return_sequences = True` than from using `Flatten`. `return_sequences = True` will make each step of the LSTM output a value rather than just the final step. This means the last parts of the network have many more output values to work with. The `Flatten` layer just changes the dimensional shape of the outputs. — golmschenk, Apr 05 '21 at 12:01

score 7 · Accepted Answer · answered Apr 05 '21 at 12:11

An LSTM layer consists of different LSTM cells that are processed sequentially. As seen in the figure below, the first cell takes an input/embedding calculates a hidden state and the next cell uses its input and the hidden state at previous time step to compute its own hidden state. Basically the arrows between the cells also pass the hidden states. If you do return_sequences=False, the lstm layer only outputs the very last hidden state! (h_4 in the figure). So, all those information from all inputs and cells are embedded in a single fixed size information and it can not contain lots of information. This is why, your accuracy is not good when you only use the last hidden state.

When you do return_sequences=True, lstm layer outputs every hidden state, so the next layers have access to all hidden states and they contain naturally more information. However, the LSTM layer returns a matrix. You can also see this in your model summary. It returns a matrix of size (None, 500, 128). None is basically number of samples in your batch, you can forget about it. 500 is your input size, and 128 is your hidden state size. The dense layer can not process a matrix, it has to be a vector. That why you need to apply flatten and what it does is basically just to open up the 2D matrix and represent it as 1D vector. Therefore, the size of your Flatten layer is 64000 because 500*128 = 64000. And Of course with more hidden states, the accuracy is better as they contain more information.

Thanks very much, this is very helpful. I had been thinking that return_sequences=True would only be used if there was more than one LSTM layer, to pass the outputs of one LSTM layer to the next. I had understood that the last (or single) LSTM layer should have return_sequences = False. But it sounds like it's also ok to pass all of the hidden state outputs to the dense layer instead? — treen, Apr 05 '21 at 12:29
Glad that I could help! I would be happy if you could upvote and accept the answer in case you found it helpful. You are absolutely right. That is the most common use case but there is no rule like a single layer has to have return_seq false. It just depends on what you want to do with the hidden states, how you want to combine them and etc. And yes, it is okay to use all hidden state outputs and even recommended since the following layers have access to much more information — Berkay Berabi, Apr 05 '21 at 12:33

What is this flatten layer doing in my LSTM?

1 Answers1