How LSTM deal with variable length sequence

Question

I found a piece of code in Chapter 7,Section 1 of deep Deep Learning with Python as follow：

from keras.models import Model
from keras import layers
from keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

# Our text input is a variable-length sequence of integers.
# Note that we can optionally name our inputs!
text_input = Input(shape=(None,), dtype='int32', name='text')

# Which we embed into a sequence of vectors of size 64
embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)

# Which we encoded in a single vector via a LSTM
encoded_text = layers.LSTM(32)(embedded_text)

# Same process (with different layer instances) for the question
question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(32, question_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# We then concatenate the encoded question and encoded text
concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)

# And we add a softmax classifier on top
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

# At model instantiation, we specify the two inputs and the output:
model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

as you see this model's input don't have raw data's shape information, then after Embedding layer, the input of LSTM or the output of Embedding are some variable length sequence.

So I want to know:

in this model, how keras to determine the number of lstm_unit in LSTM layer
how to deal with variable length sequence

Additional information: in order to explain what lstm_unit is (I don't know how to call it,so just show it image):

Check this out, to create a fixed sequence length, there's a utility function https://keras.io/preprocessing/sequence/#pad_sequences — jrjames83, Apr 19 '18 at 15:59
Since this doesn't qualify for a standalone answer, I'll leave a comment instead. Apart from zero-padding, 1-sized batches and size clustering referenced by @Daniel and @ely in their answers, there is another rarely used, yet extremely powerful, way to deal with variable-length inputs. It's called stateful learning (you might've noticed the `stateful` option in Keras' RNN layers). It's tricky to use right, but definitely worth knowing about. — Eli Korvigo, Apr 19 '18 at 18:19

ely · Accepted Answer · 2018-04-19T18:13:40.443

The provided recurrent layers inherit from a base implementation keras.layers.Recurrent, which includes the option return_sequences, which defaults to False. What this means is that by default, recurrent layers will consume variable-length inputs and ultimately produce only the layer's output at the final sequential step.

As a result, there is no problem using None to specify a variable-length input sequence dimension.

However, if you wanted the layer to return the full sequence of output, i.e. the tensor of outputs for each step of the input sequence, then you'd have to further deal with the variable size of that output.

You could do this by having the next layer further accept a variable-sized input, and punt on the problem until later on in your network when eventually you either must calculate a loss function from some variable-length thing, or else calculate some fixed-length representation before continuing on to later layers, depending on your model.

Or you could do it by requiring fixed-length sequences, possibly with padding the end of the sequences with special sentinel values that merely indicate an empty sequence item purely for padding out the length.

Separately, the Embedding layer is a very special layer that is built to handle variable length inputs as well. The output shape will have a different embedding vector for each token of the input sequence, so the shape with be (batch size, sequence length, embedding dimension). Since the next layer is LSTM, this is no problem ... it will happily consume variable-length sequences as well.

But as it is mentioned in the documentation on Embedding:

input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

If you want to go directly from Embedding to a non-variable-length representation, then you must supply the fixed sequence length as part of the layer.

Finally, note that when you express the dimensionality of the LSTM layer, such as LSTM(32), you are describing the dimensionality of the output space of that layer.

# example sequence of input, e.g. batch size is 1.
[
 [34], 
 [27], 
 ...
] 
--> # feed into embedding layer

[
  [64-d representation of token 34 ...],
  [64-d representation of token 27 ...],
  ...
] 
--> # feed into LSTM layer

[32-d output vector of the final sequence step of LSTM]

In order to avoid the inefficiency of a batch size of 1, one tactic is to sort your input training data by the sequence length of each example, and then group into batches based on common sequence length, such as with a custom Keras DataGenerator.

This has the advantage of allowing large batch sizes, especially if your model may need something like batch normalization or involves GPU-intensive training, and even just for the benefit of a less noisy estimate of the gradient for batch updates. But it still lets you work on an input training data set that has different batch lengths for different examples.

More importantly though, it also has the big advantage that you do not have to manage any padding to ensure common sequence lengths in the input.

Daniel Möller · Answer 2 · 2018-04-19T17:14:59.073

3

How does it deal with units?

Units are totally independend of length, so, there is nothing special being done.

Length only increases the "recurrent steps", but recurrent steps use always the same cells over and over.

The number of cells is fixed and defined by the user:

the first LSTM has 32 cells/units
the second LSTM has 16 cells/units

How to deal with variable length?

Approach 1: create separate batches of 1 sequence, each batch with its own length. Feed each batch to the model individually. Methods train_on_batch and predict_on_batch inside a manual loop are the easiest form.
- Ideally, separate batches per length, each batch collects all sequences with same length
Approach 2: create a fixed length batch, fill the unused tail lenght of each sequence with 0, use the parameter mask_zero=True in the embedding layers.
- Be careful not to use 0 as an actual word or meaningful data in the inputs of the embeddings.

edited Apr 19 '18 at 17:14

answered Apr 19 '18 at 17:08

Daniel Möller

84,878
18
192
214

A word of caution: creating length-1 batches would be unnecessary and highly inefficient. There is no problem having a batch with examples in the batch having different sequence lengths. It's a standard practice. – ely Apr 19 '18 at 17:12
Numpy doesn't support a batch of different lengths. The best that can be done is separating batches per length. – Daniel Möller Apr 19 '18 at 17:13
You could provide a generator or other Python sequence types, even a simple list. Keep in mind that the original input is a batch of sequences where the sequences just contain tokens (e.g. a list of lists of integers). Beyond that point, the variable sequence size tensors are handled internally based on whichever backend you choose for Keras, and don't matter from the programmer's point of view. – ely Apr 19 '18 at 17:16
2

Since this is about Keras, I'm not sure this is true. I just tested and there is an associated error (Keras 2.1.0). Keras demands one numpy array, it doesn't accept lists or arrays of arrays. --- By the way, `model.fit_generator()` will internally call `model.train_on_batch()` in a manual loop (https://github.com/keras-team/keras/blob/master/keras/engine/training.py/#L2193) – Daniel Möller Apr 19 '18 at 17:27
3

Ah, yes, I believe you are correct. Keras `fit` function does accept list of ndarrays, but only for multiple inputs, not for sequences of the same input. Actually, the batching strategy seems to be to sort the training data by sequence length, and then create batches where all items have the same sequence length. This way you don't incur the huge inefficiency of single-element batches, but still ensure rectangular shape for each batch-specific input tensor. I'll update my answer to add this detail. – ely Apr 19 '18 at 18:11

How LSTM deal with variable length sequence

2 Answers2

Linked