0

As part of my thesis, I am trying to build a recurrent Neural Network Language Model.

From theory, I know that the input layer should be a one-hot vector layer with a number of neurons equal to the number of words of our Vocabulary, followed by an Embedding layer, which, in Keras, it apparently translates to a single Embedding layer in a Sequential model. I also know that the output layer should also be the size of our vocabulary so that each output value maps 1-1 to each vocabulary word.

However, in both the Keras documentation for the Embedding layer (https://keras.io/layers/embeddings/) and in this article (https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/#comment-533252), the vocabulary size is arbitrarily augmented by one for both the input and the output layers! Jason gives an explenation that this is due to the implementation of the Embedding layer in Keras but that doesn't explain why we would also use +1 neuron in the output layer. I am at the point of wanting to order the possible next words based on their probabilities and I have one probability too many that I do not know to which word to map it too.

Does anyone know what is the correct way of acheiving the desired result? Did Jason just forget to subtrack one from the output layer and the Embedding layer just needs a +1 for implementation reasons (I mean it's stated in the official API)?

Any help on the subject would be appreciated (why is Keras API documentation so laconic?).

Edit:

This post Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2? made me think that Jason does in fact have it wrong and that the size of the Vocabulary should not be incremented by one when our word indices are: 0, 1, ..., n-1.

However, when using Keras's Tokenizer our word indices are: 1, 2, ..., n. In this case, the correct approach is to:

  1. Set mask_zero=True, to treat 0 differently, as there is never a 0 (integer) index input in the Embedding layer and keep the vocabulary size the same as the number of vocabulary words (n)?

  2. Set mask_zero=True but augment the vocabulary size by one?

  3. Not set mask_zero=True and keep the vocabulary size the same as the number of vocabulary words?

Michael
  • 325
  • 3
  • 14

1 Answers1

0

the reason why we add +1 leads to the possibility that we can encounter a chance to see an unseen word(out of our vocabulary) during testing or in production, it is common to consider a generic term for those UNKNOWN and that is why we add a OOV word in front which resembles all out of vocabulary words. Check this issue on github which explains it in detail:

https://github.com/keras-team/keras/issues/3110#issuecomment-345153450

  • Actually, using Keras's tokenizer, if I want to support OOV words then it assigns them an index (>=1, specifically the 1 index). So, I don't think that in Jason's model the extra output and input neuron correspond to OOV words, unless we (unofficially) treat the 0 index as the index of the OOV words. Which means that unknown words during training and inference should be given the index of zero manually (which I don't think Jason does). – Michael May 04 '20 at 17:55
  • 1
    That is doable though. I ll just have to implement tokenizer's text_to_sequences() myself, which isn't really that complicated. Think I'm going to try this so that indexes start at 0 and it's not confusing. Thank you! Also very helpful link. – Michael May 04 '20 at 18:40
  • Yes, tokenizer text to sequence and sequence to text is really handy. –  May 04 '20 at 19:16
  • Ok this makes sense if we have OOV word in our training data. But a thought that occured to me is this. If I don't train using OOV tokens, then why would we want (and would it ever) our model to predict an OOV word as the next word? – Michael May 04 '20 at 19:42
  • Another troubling thought is that Keras's documentation states that the Embedding layer take POSITIVE integers as an input. Does this mean it does not accept zero? Now I am confused again. Edit: Nevermind, probably a badly phrased "not negative integers", as I can infere from mask_zero's description... – Michael May 04 '20 at 19:49
  • Yes positive including 0. –  May 04 '20 at 19:59
  • Hi, if this answer solved your problem could you please mark it as accepted by clicking the check mark on its side? Thanks. –  May 04 '20 at 20:21
  • Yes, I think my results now make way more sense. I can also manually replace infrequent words with the OOV token by removing such words from tokenizer's word_index and index_word using its counts dictionary. – Michael May 05 '20 at 08:36
  • That's great. Good luck. –  May 05 '20 at 14:45