1

I am running an experiment using NLP. I am using Word2vec in order to have a distributed vector representation of the input text and then feed these representations in different Machine Learning (ML) and Deep Learning (DL) algorithms in order to measure the performance. It's a binary classification task and I have the target labels.

What regards the ML models I have used the approach in this post, which basically calculates the average word vector per observation and feeds it as an input, to the post's case, to a RandomForestClassifier.

What regards DL approaches and implementations such as CNN or LSTM I have encountered implementations such as this which the author constructs an embedding matrix which acts as a dictionary for returning the corresponding vector representation for each word in the input token sequence. The code is summarized as:

num_words = 100000
embedding_matrix = np.zeros((num_words, 200))
for word, i in tokenizer.word_index.items():
    if i >= num_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

I am trying to implement an LSTM model which shall use the approach of the average vector per observation (input token sequence). However, there is an error being thrown which I can't figure out how to resolve. In addition, I have seen this and especially this post where the author has exactly the same issue. I tried the the answered propositions but still I can't make it work.

What I have tried so far:

  • Generate trainDataVectors,testDataVectors just like in ML approaches and feed them directly to the LSTM model without adding the Embedding layer. Does not work, shape mismatch error.
  • I can't understand the answer from the last post I have different syntax.

My implementation for the LSTM goes like this:

model = Sequential()

model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    weights= [embedding_matrix],
                    input_length=max_tokens,        
                    trainable=False,              #the layer is not trained
                    name='embedding_layer'))
model.add(LSTM(units = embedding_size, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(train_seq_pad, y_train, validation_data=(test_seq_pad, y_test), epochs=20, batch_size=30)

The embedding_matrix is constructed as shown above in the post. The train_seq_pad contains the input token sequences. The shape of it is [number_of_observations, max_token_length]. One instance of it looks like this:

array([   1,    2, 1481,   20,  795, 1073,    3,    9,   11, 1073,   91,
     10,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
      0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
      0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
      0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
      0,    0,    0,    0])

I am padding zeros at the end of each sequence in order to meet the max_token_length when needed. The numbers represent the indices in the embedding matrix so to know which word vectors to recover from each input sequence.

I just want to modify the latter implementation, in order to take as input the average vector from each input token sequence just as it is performed in the ML approaches.

Any suggestions or hints ?

0 Answers0