1

I'm new to Keras. I am trying to implement this model https://www.aclweb.org/anthology/D15-1167 for document classification, and I want to use LSTM for getting sentence representation. I have trained vector representation separately with the skip-gram model on my dataset. now after converting each document to separate sentence and then converting each sentence to separate word and then converting each word to the corresponding integer in the dictionary, I have something for example like this for each document: [[54,32,13],[21,43,2]...[28,1,9]] which I should feed each sentence to an LSTM to get a sentence vector and after that I should feed each sentence vector to a diffrent LSTM on the higher layer in order to get a document representation and then apply classification to it. my problem is in the first layer. how should I feed each sentence simultaneously to each LSTM (therefore at each time step each LSTM should be applied to a word vector from each sentence)?

edit: I just used TimeDistributed and it seems like to work although I am not sure if it does what I want. I used time distributed wrapper over embeding layer and then over the first Lstm layer. this is the model that I have implemented (very simple one):

model.add(tf.keras.layers.TimeDistributed(embeding_layer))
model.add(tf.keras.layers.TimeDistributed 
(layers.LSTM(50,activation=’relu’)))
model.add(layers.LSTM(50,activation=’relu’))
model.add(layers.Dense(1,activation=’sigmoid’))

Is my interpretation of the network correct? my interpretation : my input to the embedding layer is (document, sentences, words). I padded the document to have 30 sentences and I also padded the sentences to have at 200 words. I have 20000 documents so my input shape is (20000,30,200). after feeding it to the network it first go through emeding layer which is 300 length for each word vector. so after applying embeding layer to first docuemnt with shape (1.30,200), then I get (1,30,200,300) which would be the input for the timedistributed LSTM. then time distribut, will make 30 copy of LSTM layer with shared wights where each LSTM will output a sentece vector, and then the next LSTM will be applied to this 30 sentence vectors. am I right ?

  • I created this model earlier this year, please take a look on [it](https://pastebin.com/g3AiYXVt) – ElSheikh Nov 01 '19 at 22:32
  • I am afraid that was not the answer I am looking for, and your model is different from the one in the paper. I want to feed each sentence of a document to an LSTM to get a sentence representation. if I have a document with 5 sentences then I need 5 LSTM in the first layer. so in each LSTM at each time step, a word vector would be processed . – jalil asadi Nov 01 '19 at 22:55
  • @jalilasadi Just to clarify, are you saying that each sentence position should map to a specific LSTM? In other words, the first sentence in the document will always be fed to the first LSTM and the second sentence to the second LSTM etc. From looking over the paper, it wasn't super clear to me that the design actually had that. Another interpretation might be a single LSTM network (with N outputs) that each sentence is applied to which in turn creates a sequence of sentence representations that are fed to the higher level LSTM network. Does this view make sense? I hope this helps. – ad2004 Nov 04 '19 at 21:49
  • @ad2004 I just want to have an LSTM for each sentence in a document! note that my samples are documents, so if I feed the first sample to the model, then my input is like (1, sentences, words). I have a pretreained word embedin matrix which I consider it as embeding layer. this embeding layer would produce a vector of size 300, so after it my inpute would be like (1,sentences,words,300) which can not be feed to a normal LSTM becase a normal LSTM inpute shape is like (samples,steps,features). – jalil asadi Nov 04 '19 at 23:15

1 Answers1

0

The below example might be what you are looking for, or at least point you in the right direction. It's a bit experimental on my part, but I believe it has the right structure. It was created in Google Colab with Tensorflow 2.0. The first section is provided to make the processing reproducible, but the rest illustrates the general idea of using "TimeDistributed Layer" along with masking and padding. BTW - I believe this is a similar idea to what @El Sheikh (first comment above) was providing. Note: I used a SimpleRNN here, but I believe the idea applies to LSTMs as well. I hope this helps get you moving in the right direction.

%tensorflow_version 2.x
import numpy as np
import tensorflow as tf
import random as rn

# The below is necessary for starting Numpy generated random numbers
# in a well-defined initial state.

np.random.seed(42)

# The below is necessary for starting core Python generated random numbers
# in a well-defined state.

rn.seed(12345)

# Force TensorFlow to use single thread.
# Multiple threads are a potential source of non-reproducible results.
# For further details, see: https://stackoverflow.com/questions/42022950/

session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
                                        inter_op_parallelism_threads=1)

# The below tf.set_random_seed() will make random number generation
# in the TensorFlow backend have a well-defined initial state.
# For further details, see:
# https://www.tensorflow.org/api_docs/python/tf/set_random_seed

tf.compat.v1.set_random_seed(1234)

sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)

# The code above here is provided to make the below reproducible each time you
# run.

#
# Main code follows:

from tensorflow import keras
from tensorflow.keras import layers

# Input structure
#                Sentence1                   .....         SentenceM
#    Word11  Word21   Word31  ..... Wordn11          Word11   ....  WordnM1
#    Word12  Word22   Word32        Wordn12          Word12         WordnM2
#    Word13  Word23   Word33        Wordn13          Word13         WordnM3

# example parameters
word_vec_dimension = 3   # dimension of the embedding
sentence_representation = 4 # dimensionality of sentence vector

#
# This represents a single test document.
# Each row is a sentence and the words are represented by 3 dimensionsal 
# integer vectors.
#
raw_inputs = [ [ [1, 5, 7], [2, 6, 7] ], 
               [ [9, 6, 3], [1, 8, 2], [4, 5, 9], [8, 2, 1] ],
               [ [1, 6, 2], [4, 2, 9] ],
               [ [2, 6, 2], [8, 2, 9] ],
               [ [3, 6, 2], [2, 2, 9], [1, 6, 2] ],

]

print(raw_inputs)
# Create the model
#
# Allow for variable number of words per sentence and variable number of 
# sentences:
# Input shape(num_samples, [SentenceCount], [WordCount], word_vector_dim)
# 
# Note:  Using None for Sentence Count, and None for Word count to allow
# for variable sequences length in both these dimensions.
#
inputs = keras.Input(shape=(None, None, word_vec_dimension), name='inputlayer')
x = tf.keras.layers.Masking(mask_value=0.0)(inputs)  # Force RNNs to ignore timesteps with zero vectors.
x = tf.keras.layers.TimeDistributed(layers.SimpleRNN(sentence_representation, 
                                                     use_bias=False, 
                                                     activation=None), 
                                                     name='TD1')(x)

outputs = x
# more layers here if needed:

model = tf.keras.Model(inputs=inputs, outputs=outputs, name='Sentiment')
model.compile(optimizer='rmsprop', loss='mse', accuracy='mse' )
model.summary()

# Set up fitting calls
import numpy as np

# document 1
x_train = raw_inputs # use the dummy document for testing
# Set zeros in locations where there is no data to indicate mask to RNN's so
# they ignore that timestep.
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(x_train, 
                                                              padding='post')

print(x_train)
# Insert a dummy dimension 1 to represent the sample dimension.
padded_inputs = np.expand_dims(padded_inputs,axis=0)/1.0  # Make float type
print(padded_inputs)
print(padded_inputs.shape)

y_train = np.array([[ 1.0, 2.0, 3.0, 4.0 ]])
print(y_train.shape)

# Train model:
model.fit(padded_inputs,y_train,epochs=1)

print('get_weights:')
print(model.get_layer(name='TD1').get_weights())

print('get_predictions:')
print(model.predict(padded_inputs))
ad2004
  • 809
  • 6
  • 7