Predicting the next word with Keras: how to retrieve prediction for each input word

Question

I am having some problems understanding how to retrieve the predictions from a Keras model.

I want to build a simple system that predicts the next word, but I don't know how to output the complete list of probabilities for each word.

This is my code right now:

model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=55, weights=[pretrained_weights])) 
model.add(Bidirectional(LSTM(units=embedding_size)))
model.add(Dense(23690, activation='softmax')) # 23690 is the total number of classes 

model.compile(loss='categorical_crossentropy',
          optimizer = RMSprop(lr=0.0005),
          metrics=['accuracy'])

# fit network
model.fit(np.array(X_train), np.array(y_train), epochs=10)
score = model.evaluate(x=np.array(X_test), y=np.array(y_test), batch_size=32)
prediction = model.predict(np.array(X_test), batch_size=32)

First question: Training set: list of sentences (vectorized and transformed to indices). I saw some examples online where people divide X_train and y_train like this:

X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

Should I instead transform the X_train and the y_train in order to have sliding sequences, where for example I have

X = [[10, 9, 4, 5]]
X_train = [[10, 9], [9, 4], [4, 5]]
y_train = [[9], [4], [5]]

Second question: Right now the model returns only one element for each input. How can I return the predictions for each word? I want to be able to have an array of output words for each word, not a single output. I read that I could use a TimeDistributed layer, but I have problems with the input, because the Embedding layer takes a 2D input, while the TimeDistributed takes a 3D input.

Thank you for the help!

Daniel Möller · Answer 1 · 2020-11-09T05:41:40.220

2

For what you're asking, I don't think a Bidirectional network would be good. (The reverse direction would be trying to predict something that does not appear at the end, but before the beginning, and I believe you're going to want to take the output and make it an input and keep predicting further, right?)

So, first, remove the Bidirectional from your model, keep only the LSTM.

Keras recurrent layers may output only the last step, or, if you set return_sequences=True, output all steps.

So, the trick is adjusting both the data and the model like this:

In the LSTM layers, add return_sequences=True. (Your output will be entire sentences)
Make Y be entire sentences one step ahead of X: X,y = sequences[:,:-1], sequences[:,1:]

Just be aware that this will make your output 3D. If you're interested only in the last word, you can manually take it from the output: lastWord = outputs[:,-1]

About sliding windows: don't use them. They totally defeat the purpose of LSTMs which is learning long sequences. (Ok, this statement may be exaggerated, you might want to use sliding windows if your sequences are too long for faster training, but for sentences, you probably need to have all words of the sentence otherwise the context is lost)

About TimeDistributed layers: only use them when you want to add an extra time dimension. Since LSTMs already use a time dimension, you're ok without a TimeDistributed. If you wanted, for instance to process an entire text, and you decided to go sentence by sentence, and inside each sentence word by word, you could try something with two time dimensions.

About predicting indefinitely into the future: for that, you'd have to use stateful=True LSTM layers, and create manual loops that get the last output step and feed it as an input for taking one more step.

edited Nov 09 '20 at 05:41

answered May 14 '18 at 12:48

Daniel Möller

84,878
18
192
214

thankyou for your great answers. If I have sequences(about 2 million) each with different timesteps, like some are just 5 and some are 50/60/100/200 upto 500. For prediction next item in sequence, should it be okay to trim sequences to `60 max length` with `post/pre padding` and just take last item `seq_1[:-1] as target` (target will be different timestep for each, like 6th,7th,8th for few , and time step 58th,59th,60th for other). Will taking last item be sufficient to learn next items? or should I make every sequence n-gram/sliding window? Thankyou. – A.B Nov 06 '20 at 19:01
@A.B, I think you should go with sliding windows in this case. Warning: sliding windows are incompatible with `stateful` for continuously predicting the next element. – Daniel Möller Nov 09 '20 at 05:39
Thankyou for your reply @Daniel, I think I dont have idea about how stateful works and to decide about it, can you refer in dept guide to lstm? which goes beyond the basics – A.B Nov 09 '20 at 08:12
@A.B, see my anser: https://stackoverflow.com/questions/38714959/understanding-keras-lstms/50235563#50235563 – Daniel Möller Nov 09 '20 at 13:16
thankyou for sharing the link, wonderful answer. I am just confused that you said in this answer that `They totally defeat the purpose of LSTMs which is learning long sequences.`, so what I dont understand is, is my case different? – A.B Nov 09 '20 at 20:18
@A.B, LSTM's have the ability to learn long sequences, but "how long" is something not well known. If you think the first element of your 500-sequence is important for the predicion of the last, keep the sequence entire. If you think the first elements are not important, that the last can be predicted from a few previous ones, then you can make windows. --- In your case, since you have very long sequences together with very short, it could spare processing and unnecessary (and maybe harmful) calculations with huge paddings. – Daniel Möller Nov 09 '20 at 20:22
thanku for detailed response, very helpful indeed. I converted each sequence to sliding window of size 10(now each sequence spans multiple sequences 10+, depending on the size of the original seq).it it necessary that one orignial sequence that spans for example 10 sequence should be in same batch and order(no shuffle etc) to be able to learn? Also, I am seing overfitting on a very simple network as well. If i break sequences longer than 60 into separate 60 length sequences instead of sliding, will it have same effect or it will only learn to predict 60th element? – A.B Nov 10 '20 at 10:32
I pasted my initial question here on stackoverflow too, would be very thankful if you can have a look at it too https://stackoverflow.com/questions/64750834/keras-lstm-predicting-next-item-taking-whole-sequences-or-sliding-window-will – A.B Nov 10 '20 at 10:32

Predicting the next word with Keras: how to retrieve prediction for each input word

1 Answers1