53

The docs for an Embedding Layer in Keras say:

Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

I believe this could also be achieved by encoding the inputs as one-hot vectors of length vocabulary_size, and feeding them into a Dense Layer.

Is an Embedding Layer merely a convenience for this two-step process, or is something fancier going on under the hood?

Maxim
  • 52,561
  • 27
  • 155
  • 209
Imran
  • 12,950
  • 8
  • 64
  • 79

3 Answers3

92

An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions.

Imagine a word-to-embedding layer with these weights:

w = [[0.1, 0.2, 0.3, 0.4],
     [0.5, 0.6, 0.7, 0.8],
     [0.9, 0.0, 0.1, 0.2]]

A Dense layer will treat these like actual weights with which to perform matrix multiplication. An embedding layer will simply treat these weights as a list of vectors, each vector representing one word; the 0th word in the vocabulary is w[0], 1st is w[1], etc.


For an example, use the weights above and this sentence:

[0, 2, 1, 2]

A naive Dense-based net needs to convert that sentence to a 1-hot encoding

[[1, 0, 0],
 [0, 0, 1],
 [0, 1, 0],
 [0, 0, 1]]

then do a matrix multiplication

[[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2],
 [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2],
 [0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2],
 [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]]

=

[[0.1, 0.2, 0.3, 0.4],
 [0.9, 0.0, 0.1, 0.2],
 [0.5, 0.6, 0.7, 0.8],
 [0.9, 0.0, 0.1, 0.2]]

However, an Embedding layer simply looks at [0, 2, 1, 2] and takes the weights of the layer at indices zero, two, one, and two to immediately get

[w[0],
 w[2],
 w[1],
 w[2]]

=

[[0.1, 0.2, 0.3, 0.4],
 [0.9, 0.0, 0.1, 0.2],
 [0.5, 0.6, 0.7, 0.8],
 [0.9, 0.0, 0.1, 0.2]]

So it's the same result, just obtained in a hopefully faster way.


The Embedding layer does have limitations:

  • The input needs to be integers in [0, vocab_length).
  • No bias.
  • No activation.

However, none of those limitations should matter if you just want to convert an integer-encoded word into an embedding.

The Guy with The Hat
  • 10,836
  • 8
  • 57
  • 75
38

Mathematically, the difference is this:

  • An embedding layer performs select operation. In keras, this layer is equivalent to:

    K.gather(self.embeddings, inputs)      # just one matrix
    
  • A dense layer performs dot-product operation, plus an optional activation:

    outputs = matmul(inputs, self.kernel)  # a kernel matrix
    outputs = bias_add(outputs, self.bias) # a bias vector
    return self.activation(outputs)        # an activation function
    

You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, gather operation itself is faster than matrix dot-product, both in forward and backward pass.

Løiten
  • 3,185
  • 4
  • 24
  • 36
Maxim
  • 52,561
  • 27
  • 155
  • 209
  • 8
    The text paragraph was helpful, but I do not understand the part at the beginning the is supposed to explain how the Embedding layer works. Care to elaborate? What exactly is select, gather, self.embeddings and how does this emulate the dense layer with one-hot-encoding? – Make42 Sep 07 '18 at 13:52
  • 4
    Hi maxim, I do agree with the representation efficiency you mention. However, the number of free parameters to optimise for a look-up embedding vs a dense-layer embedding is (almost) the same, i.e. vocab_size x embedding_dim, so there isn't much difference from that respect. When the representation efficiency is less of an issue (e.g. for character level encoding), are there any other reasons to prefer a lookup embedding vs a dense layer? (I see you mention computation time) – Visionscaper Jan 31 '19 at 13:55
  • 3
    Can someone elaborate on how does the embedding layer avoids doing the one-hot-encoding ? – Pranjal Sahu Mar 27 '19 at 06:10
  • 2
    @Make42 see the Guy with the Hat's answer – Super-intelligent Shade Nov 14 '19 at 22:37
  • 2
    @sahu see the Guy with the Hat's answer – Super-intelligent Shade Nov 14 '19 at 22:38
2

Here I want to improve the voted answer by providing more details:

When we use embedding layer, it is generally to reduce one-hot input vectors (sparse) to denser representations.

  1. Embedding layer is much like a table lookup. When the table is small, it is fast.

  2. When the table is large, table lookup is much slower. In practice, we would use dense layer as a dimension reducer to reduce the one-hot input instead of embedding layer in this case.

kiryu nil
  • 31
  • 2