I'm new to Keras. I am trying to implement this model https://www.aclweb.org/anthology/D15-1167 for document classification, and I want to use LSTM for getting sentence representation. I have trained vector representation separately with the skip-gram model on my dataset. now after converting each document to separate sentence and then converting each sentence to separate word and then converting each word to the corresponding integer in the dictionary, I have something for example like this for each document: [[54,32,13],[21,43,2]...[28,1,9]] which I should feed each sentence to an LSTM to get a sentence vector and after that I should feed each sentence vector to a diffrent LSTM on the higher layer in order to get a document representation and then apply classification to it. my problem is in the first layer. how should I feed each sentence simultaneously to each LSTM (therefore at each time step each LSTM should be applied to a word vector from each sentence)?
edit: I just used TimeDistributed and it seems like to work although I am not sure if it does what I want. I used time distributed wrapper over embeding layer and then over the first Lstm layer. this is the model that I have implemented (very simple one):
model.add(tf.keras.layers.TimeDistributed(embeding_layer))
model.add(tf.keras.layers.TimeDistributed
(layers.LSTM(50,activation=’relu’)))
model.add(layers.LSTM(50,activation=’relu’))
model.add(layers.Dense(1,activation=’sigmoid’))
Is my interpretation of the network correct? my interpretation : my input to the embedding layer is (document, sentences, words). I padded the document to have 30 sentences and I also padded the sentences to have at 200 words. I have 20000 documents so my input shape is (20000,30,200). after feeding it to the network it first go through emeding layer which is 300 length for each word vector. so after applying embeding layer to first docuemnt with shape (1.30,200), then I get (1,30,200,300) which would be the input for the timedistributed LSTM. then time distribut, will make 30 copy of LSTM layer with shared wights where each LSTM will output a sentece vector, and then the next LSTM will be applied to this 30 sentence vectors. am I right ?