4

I'm having some difficulty understanding the input-output flow of layers in stacked LSTM networks. Let's say i have created a stacked LSTM network like the one below:

# parameters
time_steps = 10
features = 2
input_shape = [time_steps, features]
batch_size = 32

# model
model = Sequential()
model.add(LSTM(64, input_shape=input_shape,  return_sequences=True))
model.add(LSTM(32,input_shape=input_shape))

where our stacked-LSTM network consists of 2 LSTM layers with 64 and 32 hidden units respectively. In this scenario, we expect that at each time-step the 1st LSTM layer -LSTM(64)- will pass as input to the 2nd LSTM layer -LSTM(32)- a vector of size [batch_size, time-step, hidden_unit_length], which would represent the hidden state of the 1st LSTM layer at the current time-step. What confuses me is:

  1. Does the 2nd LSTM layer -LSTM(32)- receives as X(t) (as input) the hidden state of the 1st layer -LSTM(64)- that has the size [batch_size, time-step, hidden_unit_length] and passes it through it's own hidden network - in this case consisting of 32 nodes-?
  2. If the first is true, why the input_shape of the 1st -LSTM(64)- and 2nd -LSTM(32)- is the same, when the 2nd only processes the hidden state of the 1st layer? Shouldn't in our case have input_shape set to be [32, 10, 64]?

I found the LSTM visualization below very helpful (found here) but it doesn't expand on stacked-lstm networks: LSTM workings

Any help would be highly appreciated. Thanks!

Community
  • 1
  • 1
Lio Chon
  • 151
  • 1
  • 2
  • 6

2 Answers2

7

The input_shape is only required for the first layer. The subsequent layers take the output of previous layer as its input (as so their input_shape argument value is ignored)

The model below

model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(5, 2)))
model.add(LSTM(32))

represent the below architecture

enter image description here

Which you can verify it from model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_26 (LSTM)               (None, 5, 64)             17152     
_________________________________________________________________
lstm_27 (LSTM)               (None, 32)                12416     
=================================================================

Replacing the line

model.add(LSTM(32))

with

model.add(LSTM(32, input_shape=(1000000, 200000)))

will still give you the same architecture (verify using model.summary()) because the input_shape is ignore as it takes as input the tensor output of the previous layer.

And If you need a sequence to sequence architecture like below

enter image description here

you should be using the code:

model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(5, 2)))
model.add(LSTM(32, return_sequences=True))

which should return a model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_32 (LSTM)               (None, 5, 64)             17152     
_________________________________________________________________
lstm_33 (LSTM)               (None, 5, 32)             12416     
=================================================================
mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • this already helps, but how do you get from 64 to 32? Are you taking only the last 32 of 64 outputs? And how is this going for a deep bidirectional LSTM? the process is only clear to me if you have the same number of LSTM units in each layer ... – S.Maria Oct 22 '20 at 15:36
  • 1
    @S.Maria LSTM has input size and output/hidden size. Internally it uses gating mechanism to convert inputs to hidden/outputs. Here my input size is 64 and output size is 32. In case of bi-LSTM you get two outputs one for each direction. So if your output size is 32 you will get 32 for left->right LSTM and 32 for right->left LSTM. – mujjiga Oct 22 '20 at 17:44
  • I have a question, how is the interconnection between the first layer of 64 units and 32 units made? – Nadhir Mar 31 '21 at 10:34
0

In keras document, mentioned the input is [batch_size, time-step, input_dim], rather than [batch_size, time-step, hidden_unit_length], so I think 64, 32 coorresponding the X-input's has 64 features and LSTM-32 has 32 features for each time-step.

Vivek Jain
  • 2,730
  • 6
  • 12
  • 27
lynn
  • 1
  • 2