Why Keras Embedding layer's input_dim = vocab_size + 1

Question

In this code snippet from TensorFlow tutorial Basic text classification,

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1)])

As far as I understood, max_features is the size of vocabulary(with index 0 for padding and index 1 for OOV).

Also, I've done an experiment by setting layers.Embedding(max_features, embedding_dim), the tutorial can still successfully run through(screenshots below).

So why do we need input_dim=max_features + 1 here?

Does this answer your question? [Keras embedding layer masking. Why does input\_dim need to be |vocabulary| + 2?](https://stackoverflow.com/questions/43227938/keras-embedding-layer-masking-why-does-input-dim-need-to-be-vocabulary-2) — Innat, Mar 20 '21 at 07:06
Thanks for your suggestion. I've read that question. First, that question is a little dated since Keras has updated their documentation. My understanding is that we set input_dim=|vocabulary| + 1 if mask_zero=True. But this is not the case in my question, as the tutorial example doesn't enable mask because it simply connects the Embedding layer with a Dense layer. — Splash, Mar 20 '21 at 07:14

score 1 · Answer 1 · answered Jun 06 '22 at 16:13

1

Vocabulary Size = Maximum Integer Index + 1

Example:
a[0] = 'item 1'
a[1] = 'item 2'
a[2] = 'item 3'
................
Maximum Integer Index = 2
Vocabulary Size = 3

answered Jun 06 '22 at 16:13

MasterOne Piece

413
5
5

score 0 · Answer 2 · answered Mar 25 '21 at 03:40

0

I have the same question here. I am thinking here they make some mistake. They may originally want to do this for RNN with padding, such that 0 is not part of the vocabulary. So, the max_features + 1 is the input_dimension

answered Mar 25 '21 at 03:40

Jifu Zhao

21
3

grim_trigger · Answer 3 · 2022-01-07T18:16:40.043

The example is very misleading - arguably wrong, though the example code doesn't actually fail in that execution context.

The embedding layer input dimension, per the Embedding layer documentation is the maximum integer index + 1, not the vocabulary size + 1, which is what the author of that example had in the code you cite.

In my toy example below, you can see how the 0-based integer index works out:

Frankly, it looks like the writer just got lucky because he was using the Sequential model type and didn't need to serialize the model. In this special case, the example code worked.

Why Keras Embedding layer's input_dim = vocab_size + 1

3 Answers3