How to set the input for LSTM in Keras

Question

I'm new to Keras, and I find it hard to understand the shape of input data of the LSTM layer.The Keras Documentation says that the input data should be 3D tensor with shape (nb_samples, timesteps, input_dim). I'm having trouble of understanding this format. Does the timesteps variable represent the number of timesteps the network remembers?

In my data a few time steps affect the output of the network but I do not know how many in advance i.e. I can't say that the previous 10 samples affect the output. For example the input can be words that form sentences. There is an important correlation between the words in each sentence. I don't know the length of the sentence in advance, this length also vary from one sentence to another. I do know when the sentence ends (i.e. i have a period that indicates the ending). Two different sentences has no affect one on the other - there is no need to remember the previous sentence.

I'm using the LSTM network for learning a policy in reinforcement learning, so I don't have a fixed data set. The agent's policy will change the length of the sentence.

How should I shape my data? How should it be fed into the Keras LSTM layer?

Depending on how many resources you are willing to spend, you should pick a maximum length of the sentence, then truncate/fill with 0 all examples to that length. Keras require it to be fixed. — Julio Daniel Reyes, Oct 07 '17 at 14:53
So should the input shape be in this case? input_shape= (maximum_sentence_length, )? — Andrey Gurevich, Oct 07 '17 at 14:57
You should split your text into sentences, then the amount of sentences you have is your `nb_samples`. The `timesteps` is the maximum amount of words/characters. Then `input_dim` is the size of the representation of those words/characters (e.g. if you use word embeddings, the embedding size) — Julio Daniel Reyes, Oct 07 '17 at 15:04
Thanks! One last thing, if the maximum sentence length is 5 and the sentence is "I am Andrey" should I represent it as (0,0, I, am, Andrey)? — Andrey Gurevich, Oct 07 '17 at 15:08
(In my opinion) it should be (I, am, Andrey, 0, 0), but I've seen it the other way too. — Julio Daniel Reyes, Oct 07 '17 at 15:18

score 2 · Answer 1 · answered Oct 07 '17 at 16:02

Time steps is the total length of your sequence.

If you're working with words, it's the amount of words of each sentence.
If you're working with chars, it's the amount of chars of each sequence.

In a variable sentence length case, you should set that dimension to None:

#for functional API models:
inputTensor = Input((None,input_dim)) #the nb_samples doesn't participate in this definition

#for sequential models:
LSTM(units, input_shape=(None,input_dim)) #the nb_samples doesn't participate in this definition

There are two possible ways of working with variable lenghts in keras.

Fixed length with padding
Variable length separated in batches with same length

In the fixed length case, you create a dummy word/character that is meaningless, and fill your sentences to a maximum length, so all sentences have the same length. Then you add a Masking() layer that will ignore that dummy word/char.

The Embedding layers already have a mask_zeros parameter, then, if working with embeddings, you can make the id 0 be a dummy char/word.

In the variable length, you just separate your input data in smaller batches, like here: Keras misinterprets training data shape

How to set the input for LSTM in Keras

1 Answers1