9

In the Keras docs for Embedding https://keras.io/layers/embeddings/, the explanation given for mask_zero is

mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal |vocabulary| + 2).

Why does input_dim need to be 2 + number of words in vocabulary? Assuming 0 is masked and can't be used, shouldn't it just be 1 + number of words? What is the other extra entry for?

Nigel Ng
  • 543
  • 1
  • 7
  • 21
  • 1
    The docs have been updated. if `mask_zero:True`, then `input_dim` equals the number of vocabulary +1 for the special zero masking – Ali Abul Hawa May 28 '18 at 10:46

2 Answers2

15

I believe the docs are a bit misleading there. In the normal case you are mapping your n input data indices [0, 1, 2, ..., n-1] to vectors, so your input_dim should be as many elements as you have

input_dim = len(vocabulary_indices)

An equivalent (but slightly confusing) way to say this, and the way the docs do, is to say

1 + maximum integer index occurring in the input data.

input_dim = max(vocabulary_indices) + 1

If you enable masking, value 0 is treated differently, so you increment your n indices by one: [0, 1, 2, ..., n-1, n], thus you need

input_dim = len(vocabulary_indices) + 1

or alternatively

input_dim = max(vocabulary_indices) + 2

The docs become especially confusing here as they say

(input_dim should equal |vocabulary| + 2)

where I would interpret |x| as the cardinality of a set (equivalent to len(x)), but the authors seem to mean

2 + maximum integer index occurring in the input data.

Nils Werner
  • 34,832
  • 7
  • 76
  • 98
  • I see. That makes sense. Thanks for the thorough answer! I wonder if we could submit a pull request for the Keras documentation to make it less misleading. – Nigel Ng Apr 05 '17 at 10:58
  • @NilsWerner So in this example, is `max(vocabulary_indices)` the number of distinct words, or the total number of words in your data set? i.e., if `vocabulary_indexes=[1,2,...i,i,...,n-1,n]` is `input_dim=n+1` because `n` is `max(vocabulary_indexes)` or is `input_dim=n+2` because `i` occurs twice in `vocabulary_indexes` so `input_dim=len(vocabulary_indices)+1=(n+1)+1=n+2`? – mickey Oct 29 '19 at 13:42
0

Because the input_dim already is +1 of the vocabulary, so you just add another +1 for the 0 and get the +2.

input_dim: int > 0. Size of the vocabulary, ie. 1 + maximum integer index occurring in the input data.

Paul
  • 500
  • 3
  • 8