How to implement a hierarchical model of LSTM units in Keras?

Question

I would like to implement the model depicted in the following picture using Keras, but I have no idea how to do it.

If the input of the model was given like (batch, max_length_sentence, max_length_of_word), how would I need to implement it?

Could you please elaborate more? What is `S1`, `S2`, ... and `x1`, `x2`, ...? — today, Sep 13 '18 at 16:02
@Thxkew So is it unrolled over time and input data? I mean the input of the model is multiple sentences consisting of words and the whole model is two stacked lstm layers, right? If that's the case, then the input shape would be `(batch_size, max_num_sentence, max_num_words, n_features)` where `n_features` could be one or 10 or 50 (i.e. word vectors). Is that right? — today, Sep 13 '18 at 16:23
If the answer resolved your issue, kindly *accept* it by clicking on the checkmark next to the answer to mark it as "answered" - see [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers) — today, Oct 20 '18 at 15:14

score 4 · Answer 1 · answered Sep 13 '18 at 16:52

4

If I understand your question correctly, each single training sample consists of multiple sentences where each sentence consists of multiple words (it seems that each training sample is sentences of a text document). The first LSTM layer processes a single sentence and then after processing all the sentences, the representation of sentences by the first LSTM layer is fed to the second LSTM layer.

To implement this architecture, you need to wrap the first LSTM layer inside a TimeDistributed layer to allow it to process each sentence individually. Then you can simply add another LSTM layer on top to process the outputs of first LSTM layer. Here is an sketch:

lstm1_units = 128
lstm2_units = 64

max_num_sentences = 10
max_num_words = 100
emb_dim = 256

model = Sequential()
model.add(TimeDistributed(LSTM(lstm1_units), input_shape=(max_num_sentences, max_num_words, emb_dim)))
model.add(LSTM(lstm2_units, return_sequences=True))
model.summary()

Model summary:

Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_4 (TimeDist (None, 10, 128)           197120    
_________________________________________________________________
lstm_6 (LSTM)                (None, 10, 64)            49408     
=================================================================
Total params: 246,528
Trainable params: 246,528
Non-trainable params: 0
_________________________________________________________________

As you can see, since we have used return_sequences=True for the second LSTM layer, its output corresponding to each sentence is returned (this is in accordance with the figure in your question). Further, note that here we have assumed that the words have been represented using word vectors (i.e. word embeddings). If that's not the case, and you would like to do so, you can simply add an Embedding layer (wrapped in a TimeDistributed layer) as the first layer to represent the words using word embeddings and the rest would be the same.

answered Sep 13 '18 at 16:52

today

32,602
8
95
115

I tried to use Emdedding layer as the first layer. the input of the layer should be 2D shape (batch_size, sequence_length) but the input of the model is 3D shape. how should i do to make the input shape as input_shape in the model you wrote. thank you – Thxkew Sep 16 '18 at 17:33
@Thxkew As I mentioned in my answer, wrap it inside a `Timedistributed` layer: `model.add(TimeDistributed(Embedding(...), input_shape=(max_num_sentences, max_num_words)))`. – today Sep 16 '18 at 18:21
There was a warning error "This model has never been called, this its weights have not yet been created, so no summary can be displayed. Build the model first (e.g. by calling it on some test data)." when I wrapped an Embedding layer inside a Timedistributed layer and called "model.summarry()". Is that an issue? Or I can use the model without the .summary function. Thank you – Thxkew Sep 17 '18 at 06:58
@Thxkew `model.summary()` just prints the architecture and output shape of layers so it is not necessary to use it. However, you should not get such a warning. Compile the model and starting fitting it on the training data and see if it works or not. If there was any error, create a [Github Gist](https://gist.github.com/) containing your code and post the link here so that I can take a look at it. – today Sep 17 '18 at 07:33
@today Hello, Thanks for your implementation. Here is one difference, where in my mode, there is no max_num_sentences. Each paragraph(sample) consists of sentences (lists of words) in variable count. It seems that it does not fit the keras input - output requirement. Should I pad each paragraph to fix length of sentences ? – yanachen May 20 '20 at 11:06
@yanachen Yes, padding is one option (optionally, with the addition of masking; see [this answer](https://stackoverflow.com/a/52570297/2099607)). However, alternatively you could consider that the number of timesteps in an RNN could be different from one batch to the next batch (see [this answer](https://stackoverflow.com/a/53496805/2099607)); but, the dimension of all the samples in a batch should be the same. – today May 20 '20 at 12:14
@yanachen Also, the size of first axis of input sample could be different from one batch to the next batch when using `TimeDistributed` layer; but, again, it should be the same for all the samples in a batch. – today May 20 '20 at 12:24
@today Thanks for your help. It seems that it should always keep the same length, both for count of words and count of sentences, that is to say, any dimension, except the batch_size dimension, should keep consistent within each time of feeding. Padding is always necessary. – yanachen May 21 '20 at 02:01

How to implement a hierarchical model of LSTM units in Keras?

1 Answers1