How to build an end-to-end NLP RNN classification model?

Question

I have trained a NLP model with RNN on keras to classify tweets with word embeddings (Stanford GloVe) used as a feature selection method. I would like to apply this model trained onto new tweets extracted. However, this error appears when I try to apply the model to new data.

ValueError: Error when checking input: expected input_1 to have shape (22,) but got array with shape (51,)

Then I realised that the model trained is expecting an input with a 22-input vector (the max tweet length in the training set tweets). On the other hand, the new dataset I would like to apply the model to has a 51-input vector (the max tweet length in the new dataset tweets).

In attempt to tackle this, I increased the size of the vector when training the model to 51 so both would be balanced. A new error pops up:

InvalidArgumentError:  indices[29,45] = 5870 is not in [0, 2489)

Thus, I decided to try to apply the model back on the training dataset to see if it was possible in the first place with the original parameters and model. It was not and a very similar error appeared.

InvalidArgumentError:  indices[23,11] = 2489 is not in [0, 2489)

In this case, how can I export an end-to-end NLP RNN classification model to apply on new unseen data? (FYI: I was able to successfully to do this for Logistic Regression with TF-IDF used as a feature selection. There just seems to be a lot of issues with the Word Embeddings.)

===========

UPDATE:
I was able to solve this issue by pickling not only the model, but also variables such as the max_len, texttotensor_instance and tokenizer. When applying the model to new data, I will have to use the same variables generated from the training data (instead of redefining them with the new data).

score 0 · Answer 1 · answered Oct 27 '20 at 20:55

Your error is because the maximum number of words in your training data exceeds the max in the Embeddings layer (aka. input_dim).

It seems that the input_dim param. in your Embeddings layer is set to 2489, where you have words in your dataset tokenized and mapped to a higher value (5870).

Also don't forget to add one to the maximum # of words when you set this in the Embedding layer (input_dim=max_number_of_words+1). If you're interested to know why check this question: Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2?

How to build an end-to-end NLP RNN classification model?

1 Answers1