1

I have a more theoretical question i wasn't able to find an answer to. Lets say i have as Input a list of numbers:

input = [0,5,0,4,3,3,2,1]

And lets say as the first hidden layer consists of 3 LSTM nodes. How is now the list presented to the LSTM (with timesteps=8)?

My first idea is:

Input timestep 1:
node 1 = 0, node 2 = 0, node 3 = 0
Input timestep 2:
node 1 = 5, node 2 = 5, node 3 = 5
...

so each node sees the same input in every timestep.

My second idea is:

Input timestep 1:
node 1 = 0, node 2 = 5, node 3 = 0
Input timestep 2:
node 1 = 5, node 2 = 0, node 3 = 4
...
Input timestep 8:
node 1 = 1, node 2 = -, node 3 = -

in each timestep each node gets a different input, the input is like a sliding window moving from left to right over the list. In this case every element from the list (every number) is presented unequally often to the LSTM.

My last idea is:

Input timestep 1:
node 1 = 0, node 2 = 5, node 3 = 0
next timestep:
node 1 = 4, node 2 = 3, node 3 = 3
last timestep:
node 1 = 2, node 2 = 1, node 3 = -

so again each node gets a different input but this time the window doesn’t slide over the list it rather jumps. In this case each number is only one time presented to the LSTM.

I would guess that the first idea is how it works, but i don't know. Or is it completely different?

Maxim
  • 52,561
  • 27
  • 155
  • 209
user312549
  • 63
  • 8

1 Answers1

2

RNN is usually used to recognize patterns in sequential data, i.e. one must feed the sequences to the cells in order to capture it. Your first idea does not feed in the sequence, so the network can't recognize any meaningful information, such as for instance "the next symbol is likely to be smaller than the current one, except when there's a 0".

Here's how input looks like in most cases:

rnn

... where (x[0], x[1], ...) is the input.


Your second and third idea differ only in the way you split the long data into subsequences, and actually both options are possible, depending on the nature of the data:

  • When [0,5,0,4,3,3,2,1] is one big sentence, you'd like to capture the rules from all parts of it. For that you'd feed all subsequences of length 3 into the network, because any triplet can be meaningful. (Side note: there's also a variant of stateful RNN to help dealing with this, but it's really a technical detail)

  • When [0,5,0], [4,3,3] and [2,1] are different sentences, it doesn't make sense to learn the dependency like [0,4,3], where one starts the sequence with the last word of the previous sentence. Each sentence is likely to be independent, but within a sentence you'd want to slide over all windows of size 3. (Side note: stateful RNN can be useful in this case too, e.g. when there is a story and the meaning of the previous sentence may affect the understanding of the current one).

So the last two guesses are very close to be correct.

The picture is from this post, which I highly recommend to read.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • But the use all tuple approach from "When [0,5,0,4,3,3,2,1] is one big sentence" is rather strange. This would mean that my training tasks exponentially more time for a longer sequence. But in real life when I use keras my timesteps increase linear and so the time spend for training. Due to the fact that each node is able to "memorize" the old input, even my first idea should be able to work on sequential data. – user312549 Apr 18 '18 at 09:47
  • This is not about what keras does, it's about what data is fed in. In real life LSTM length is chosen to be high enough to process the whole sentences or the sentences got trimmed. – Maxim Apr 18 '18 at 09:52