Varying sequence length in Keras without padding

Question

I have a question regarding varying sequence lengths for LSTMs in Keras. I'm passing batches of size 200 and sequences of variable lengths (= x) with 100 features for each object in the sequence (=> [200, x, 100]) into a LSTM:

LSTM(100, return_sequences=True, stateful=True, input_shape=(None, 100), batch_input_shape=(200, None, 100))

I'm fitting the model on the following randomly created matrices:

x_train = np.random.random((1000, 50, 100))
x_train_2 = np.random.random((1000, 10,100))

As far as I understood LSTMs (and the Keras implementation) correctly, the x should refer to the number of LSTM cells. For each LSTM cell a state and three matrices have to be learned (for input, state and output of the cell). How is it possible to pass varying sequence lengths into the LSTM without padding up to a max. specified length, like I did? The code is running, but it actually shouldn't (in my understanding). It's even possible to pass another x_train_3 with a sequence length of 60 afterwards, but there shouldn't be states and matrices for the extra 10 cells.

By the way, I'm using Keras version 1.0.8 and Tensorflow GPU 0.9.

Here is my example code:

from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np
from keras import backend as K

with K.get_session():

    # create model
    model = Sequential()
    model.add(LSTM(100, return_sequences=True, stateful=True, input_shape=(None, 100),
             batch_input_shape=(200, None, 100)))
    model.add(LSTM(100))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])

    # Generate dummy training data
    x_train = np.random.random((1000, 50, 100))
    x_train_2 = np.random.random((1000, 10, 100))
    y_train = np.random.random((1000, 2))
    y_train_2 = np.random.random((1000, 2))

    # Generate dummy validation data
    x_val = np.random.random((200, 50, 100))
    y_val = np.random.random((200, 2))

    # fit and eval models
    model.fit(x_train, y_train, batch_size=200, nb_epoch=1, shuffle=False, validation_data=(x_val, y_val), verbose=1)
    model.fit(x_train_2, y_train_2, batch_size=200, nb_epoch=1, shuffle=False, validation_data=(x_val, y_val), verbose=1)
    score = model.evaluate(x_val, y_val, batch_size=200, verbose=1)

score 14 · Accepted Answer · edited Jun 20 '20 at 09:12

14

First: it doesn't seem you need the stateful=True and the batch_input. These are intended for when you want to divide a very long sequence(s) in parts, and train each part separately without the model thinking that the sequence has come to an end.

When you use stateful layers, you have to reset/erase the states/memory manually when you decide that a certain batch is the last part of the long sequence(s).

You seem to be working with entire sequences. No stateful is needed.

Padding is not strictly necessary, but it seems you can use padding + masking to ignore the additional steps. If you don't want to use padding, you can separate your data in smaller batches, each batch with a distinct sequence length. See this: Keras misinterprets training data shape

The sequence length (time steps) does not change the number of cells/units or the weighs. It's possible to train using different lengths. The dimension that cannot change is the amount of features.

Input dimensions:

The input dimensionas are (NumberOfSequences, Length, Features).
There is absolutely no relation between the input shape and the number of cells. It carries only the number of steps or recursions, which is the Length dimension.

Cells:

Cells in LSTM layers behave exacly like "units" in dense layers.

A cell is not a step. A cell is only the number of "parallel" operations. Each group of cells performs together the recurrent operations and steps.

There is conversation between the cells, as @Yu-Yang well noticed in the comments. But the idea of they being the same entity carried over through steps is still valid.

Those little blocks you see in images such as this are not cells, they are steps.

Variable lengths:

That said, the length of your sequences don't affect at all the number of parameters (matrices) in the LSTM layer. It just affects the number of steps.

The fixed number of matrices inside the layer will be recalculated more times for long sequences, and less times for short sequences. But in all cases, it's one matrix getting updates and being passed forward to the next step.

Sequence lengths vary only the number of updates.

The layer definition:

The number of cells can be any number at all, it will just define how many parallel mini brains will be working together (it means a more or less powerful network, and more or less output features).

LSTM(units=78) 
#will work perfectly well, and will output 78 "features".
#although it will be less intelligent than one with 100 units, outputting 100 features.

There is a unique weight matrix and a unique state/memory matrix that keeps being passed forward to the next steps. These matrices are simply "updated" in each step, but there isn't one matrix for each step.

Image examples:

Each box "A" is a step where the same group of matrices (states,weights,...) is used and updated.

There aren't 4 cells, but one and the same cell performing 4 updates, one update for each input.

Each X1, X2, ... is one slice of your sequence in the length dimension.

Longer sequences will reuse and update the matrices more times than shorter sequences. But it's still one cell.

The number of cells indeed affects the size of matrices, but doesn't depend on the sequence length. All cells will work togheter in parallel, with some conversation between them.

Your model

In your model you can create the LSTM layers like this:

model.add(LSTM(anyNumber, return_sequences=True, input_shape=(None, 100)))
model.add(LSTM(anyOtherNumber))

By using None in the input_shape like that, you are already telling your model that it accepts sequences in any length.

All you have to do is train. And your code for training is ok. The only thing that is not allowed is to create a batch with different lengths inside. So, as you have done, create a batch for each length and train each batch.

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 22 '17 at 14:20

Daniel Möller

84,878
18
192
214

Any reference on the "cells" you've described? I've never seen LSTM being described like that. Also, `LSTM(cells=78)` definitely won't work. There's just no such keyword in the function call. By "78 features", maybe you are talking about `units`? – Yu-Yang Sep 22 '17 at 15:18
If "cells" actually means "units" (as seen in some papers), then normally one won't say something like "a LSTM layer with 2 cells, they work in parallel". The sentence "Each cell performs its own recurrent operations and steps, each cell has its own weights and states" also makes no sense at all. – Yu-Yang Sep 22 '17 at 15:35
The last figures are also confusing. Correct me if I'm wrong, but I assume the inputs should be `x` coming from below. Then what's the "input" coming from the left? – Yu-Yang Sep 22 '17 at 15:38
The inputs come both from below and from the left. `X` are inputs, and `h` are outputs. But `X1` will wait for the result of `X0`, because `X1` depends on either `H0` and the updated states for the first step. (That's why there are horizontal arrows between each A). – Daniel Möller Sep 22 '17 at 16:06
Why does the "parallel" working not make sense? It's exactly what happens. But for computer and mathematical purposes, all individual weights are stacked in a huge single matrix. – Daniel Möller Sep 22 '17 at 16:08
Because they don't work in parallel. Whenever there's a matrix-vector product `y = W * x`, each unit (or "cell") in `y` will depend on multiple units of `x`. – Yu-Yang Sep 22 '17 at 16:12
Just to clarify, by "input coming from the left", I mean the white arrows, not those between each A. – Yu-Yang Sep 22 '17 at 16:20
The multiple elements they depend on are the multiple "features". The weight matrix has shape `(cellsOrUnitsInThisLayer, inputFeatures)`. It will be multiplied by a vector shaped like `(features,)`. (One matrix multiplication per sample). If you perform this multiplication manually, you will see that each row in the output comes from an individual row in the weight matrix. You can conclude by that that each row in the result is mathematically independent from the other rows. – Daniel Möller Sep 22 '17 at 16:24
If you look in the [source code](https://github.com/fchollet/keras/blob/master/keras/layers/recurrent.py#L1532), though, the weights are transposed and the multiplication is inverted. The operation is `x * W`, where `x` is `(1,features)` and `W` is `(features,units)`. The result will be in columns, each column totally independ from the others. If you replace the 1 by the batch size, you will get independent results for each sample. --- There are actuall 4 different weight matrices in the code. – Daniel Möller Sep 22 '17 at 16:26
I'm not talking about multiplying the input. You also have to perform multiplication on the hidden state `h`. The matrices are of shape `(units, units)` and the, e.g. 78, cells are dependent to each other. – Yu-Yang Sep 22 '17 at 16:33
1

In a `LSTM(78)` layer, there are no 78 streams walking in parallel. The 78 units in time step `t+1` will depend on all 78 units in time `t`. – Yu-Yang Sep 22 '17 at 16:36
Hmmm, you got me there. Even though, it's clear that they're totally independent from the sequence length, right? The length participates in the amount of steps. – Daniel Möller Sep 22 '17 at 16:40
Yes, I agree with that. The same weight matrices are reused in all steps and the number of weight matrices won't grow with the number of steps. It's just the description about cells that confuses me. – Yu-Yang Sep 22 '17 at 16:49
1

Thanks for the detailed explanation of LSTM's. The original question was asking about Keras and how to avoid padding with LSTM(). I think it would be helpful to show what the two lines involving `LSTM(...)` should look like. – Dustin Boswell Sep 22 '17 at 19:05
You were already right in your method for training. The only correction that was important was removing the `stateful=True` and the `batch_input`. – Daniel Möller Sep 22 '17 at 19:14
1

You could use this answer to separate a batch in many batches with unique lenghts each: https://stackoverflow.com/questions/46144191/keras-misinterprets-training-data-shape/46146146#46146146 – Daniel Möller Sep 22 '17 at 19:44
Well, thank you for your detailed answers and comments. By the time I asked this question, I assumed each timestep has its own matrices, which is nonsense. That's why I thought it impossible to vary the sequence lengths while training, because, where would the new matrices come from if a longer sequence suddenly appeared? Also, you're right about the stateful parameter, which indeed wasn't needed here. I just padded the sequences with zeros and used masking to handle the varying lengths of sequences. – V1nc3nt Sep 23 '17 at 09:52
It's a shame to have to use padding, because if your average input is much smaller than the maximum, you're wasting a lot of space/time dealing with those padded values. – Dustin Boswell Sep 25 '17 at 20:24

Varying sequence length in Keras without padding

1 Answers1

Image examples:

Your model