0

I am new to deep learning. I am trying to make very basic LSTM network on word embedding feature. I have written the following code for the model but I am unable to run it.

from keras.layers import Dense, LSTM, merge, Input,Concatenate
from keras.layers.recurrent import LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten


max_sequence_size = 14
classes_num = 2

LSTM_word_1 = LSTM(100, activation='relu',recurrent_dropout = 0.25, dropout = 0.25)
lstm_word_input_1 = Input(shape=(max_sequence_size, 300))
lstm_word_out_1 = LSTM_word_1(lstm_word_input_1)


merged_feature_vectors = Dense(50, activation='sigmoid')(Dropout(0.2)(lstm_word_out_1))

predictions = Dense(classes_num, activation='softmax')(merged_feature_vectors)

my_model = Model(input=[lstm_word_input_1], output=predictions)
print my_model.summary()

The error I am getting is ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (3019, 300). On searching, I found that people have used Flatten() which will compress all the 2-D features (3019,300) for the dense layer. But I am unable to fix the issue.

While explaining, kindly let me know how do the dimension work out.

Upon request:

My X_training had dimension issues, so I am providing the code below to clear out the confusion,

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    #
    # Index2word is a list that contains the names of the words in
    # the model's vocabulary. Convert it to a set, for speed
    index2word_set = set(model.wv.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    #
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec

I think the following code is giving 2-D numpy array as I am initializing it that way

def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate
    # the average feature vector for each one and return a 2D numpy array
    #
    # Initialize a counter
    counter = 0.
    #
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")

    for review in reviews:

       if counter%1000. == 0.:
           print "Question %d of %d" % (counter, len(reviews))

       reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
           num_features)

       counter = counter + 1.
    return reviewFeatureVecs


def getCleanReviews(reviews):
    clean_reviews = []
    for review in reviews["question"]:
        clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
    return clean_reviews

My objective is just to use gensim pretrained model for LSTM on some comments that I have.

trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )
Milo Lu
  • 3,176
  • 3
  • 35
  • 46
amy
  • 342
  • 1
  • 5
  • 18
  • How many samples do you have? It seems you are feeding the model with only one sample of shape `(3019, 300)` whereas the training data passed to `fit` method must have a shape of `(num_samples, num_steps, 300)` in this case. – today Sep 26 '18 at 23:30
  • I have 3019 comments. I am using the word2vec and getting the features in 1-D array of dimension 300. That is why it is showing (3019,300). I am not sure what is the time steps and how to get that number. Do I need to reshape the matrix? – amy Sep 26 '18 at 23:49
  • How many words are there in each comment? 14? So the training data must have a shape of `(3019, 14, 300)`. – today Sep 27 '18 at 00:05
  • It varies but yes on average I have 14. so you are trying to say that I need to reshape my `X` while fitting. – amy Sep 27 '18 at 00:16
  • I tried to reshape but it says `trainDataVecs=trainDataVecs.reshape(3019,14,300) ValueError: cannot reshape array of size 905700 into shape (3019,14,300)`. Should I add embedding layer instead? – amy Sep 27 '18 at 00:22
  • You mentioned you have already used word2vec?! So there is no need to use an Embedding layer. I can't understand why the shape of your training data is `(3019,300)`? You have 3019 samples of shape 300?!! Each sample is a sentence so it has multiple words and each word is represented with a vector of length 300, therefore the shape of **each sample** must be `(num_words_in_a_sentence, 300)`?! – today Sep 27 '18 at 00:33
  • Could you add the data preparation code as well? – today Sep 27 '18 at 00:35
  • Ok, I have added the code. If you want `KaggleWord2VecUtility` file as well, let me know. But I think the issue is with `getAvgFeatureVecs` as I am initializing it to be 2-D array. Hence, I am getting that. – amy Sep 27 '18 at 01:22
  • I am using gensim model as `model`. – amy Sep 27 '18 at 01:28
  • If you are taking the average of all the word vectors in a sentence, then the result would not be a sequence. Therefore you cannot feed it to a LSTM layer. You can either use the Dense layer instead, or don't take the average of word vectors and feed them as they are to the LSTM layer. – today Sep 27 '18 at 08:28
  • Yeah, I think I need to convert the `text to sequence` for training, because right now I am representing each comment into 300 dimensional space. So it is `3019,300`. Is there any other way to implement it in my code? – amy Sep 27 '18 at 17:08

1 Answers1

0

You should try using Embedding layer before LSTM layer. Also, since you have pre-trained vectors of 300-dimensions for 3019 comments, you can initialize the weights for embedding layer with this matrix.

inp_layer = Input((maxlen,))
x = Embedding(max_features, embed_size, weights=[trainDataVecs])(x)
x = LSTM(50, dropout=0.1)(x)

Here, maxlen is the maximum length of your comments, max_features is the maximum number of unique words or vocabulary size of your dataset, and embed_size is dimensions of your vectors, which is 300 in your case.

Note that shape of trainDataVecs should be (max_features, embed_size), so if you have pre-trained word vectors loaded into trainDataVecs, this should work.

  • max_features is vocabulary size. My `trainDataVecs` has dimension of `(3019,300)` and 3019 is the number of comments. So, embed_size if okay but I don't think that I have the correct max_features for `trainDataVecs`. As you can see from my code, I think I am representing the comment into 300-d space, which is incorrect I think. I should represent words and not sentences. – amy Sep 27 '18 at 17:14
  • Exactly yes. Instead of averaging out the vectors for obtaining vector of whole comment, just pass the 2-D numpy array of word vectors to `Embedding` layer as `trainDataVecs` – Ankit Paliwal Sep 27 '18 at 17:53