173
tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None)

I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding to each id (in ids)?

For instance, in the skip-gram model if we use tf.nn.embedding_lookup(embeddings, train_inputs), then for each train_input it finds the correspond embedding?

kmario23
  • 57,311
  • 13
  • 161
  • 150
Poorya Pzm
  • 2,123
  • 3
  • 12
  • 9

9 Answers9

232

Yes, this function is hard to understand, until you get the point.

In its simplest form, it is similar to tf.gather. It returns the elements of params according to the indexes specified by ids.

For example (assuming you are inside tf.InteractiveSession())

params = tf.constant([10,20,30,40])
ids = tf.constant([0,1,2,3])
print tf.nn.embedding_lookup(params,ids).eval()

would return [10 20 30 40], because the first element (index 0) of params is 10, the second element of params (index 1) is 20, etc.

Similarly,

params = tf.constant([10,20,30,40])
ids = tf.constant([1,1,3])
print tf.nn.embedding_lookup(params,ids).eval()

would return [20 20 40].

But embedding_lookup is more than that. The params argument can be a list of tensors, rather than a single tensor.

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

In such a case, the indexes, specified in ids, correspond to elements of tensors according to a partition strategy, where the default partition strategy is 'mod'.

In the 'mod' strategy, index 0 corresponds to the first element of the first tensor in the list. Index 1 corresponds to the first element of the second tensor. Index 2 corresponds to the first element of the third tensor, and so on. Simply index i corresponds to the first element of the (i+1)th tensor , for all the indexes 0..(n-1), assuming params is a list of n tensors.

Now, index n cannot correspond to tensor n+1, because the list params contains only n tensors. So index n corresponds to the second element of the first tensor. Similarly, index n+1 corresponds to the second element of the second tensor, etc.

So, in the code

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

index 0 corresponds to the first element of the first tensor: 1

index 1 corresponds to the first element of the second tensor: 10

index 2 corresponds to the second element of the first tensor: 2

index 3 corresponds to the second element of the second tensor: 20

Thus, the result would be:

[ 2  1  2 10  2 20]
nbro
  • 15,395
  • 32
  • 113
  • 196
Asher Stern
  • 2,476
  • 1
  • 10
  • 5
  • 8
    a note: you can use `partition_strategy='div'`, and would get `[10, 1, 10, 2, 10, 20]`, i.e. `id=1` is the second element of the first param. Basically: `partition_strategy=mod` (default) `id%len(params)`: index of the param in params `id//len(params)`: index of the element in the above param `partition_strategy=*div*` the other way around – Mario Alemi Jun 29 '17 at 16:47
  • 4
    @asher-stern could you explain why "mod" strategy is default? seems that "div" strategy is more similar to the standard tensor slicing(select-rows by given indices). Is there some performance issues in case of "div"? – svetlov.vsevolod Aug 10 '17 at 09:14
150

embedding_lookup function retrieves rows of the params tensor. The behavior is similar to using indexing with arrays in numpy. E.g.

matrix = np.random.random([1024, 64])  # 64-dimensional embeddings
ids = np.array([0, 5, 17, 33])
print matrix[ids]  # prints a matrix of shape [4, 64] 

params argument can be also a list of tensors in which case the ids will be distributed among the tensors. For example, given a list of 3 tensors [2, 64], the default behavior is that they will represent ids: [0, 3], [1, 4], [2, 5].

partition_strategy controls the way how the ids are distributed among the list. The partitioning is useful for larger scale problems when the matrix might be too large to keep in one piece.

nbro
  • 15,395
  • 32
  • 113
  • 196
Rafał Józefowicz
  • 6,215
  • 2
  • 24
  • 18
  • 21
    Why would they call it this way and not `select_rows`? – Lenar Hoyt Jul 27 '16 at 08:44
  • 12
    @LenarHoyt because this idea of a lookup comes from Word Embeddings. and the "rows" are the representations (embeddings) of the words, into a vector space -- and are useful in an of themself. Often more so than the actual network. – Frames Catherine White Oct 22 '16 at 02:00
  • 2
    How does tensorflow learn the embedding structure? Does this function manage that process too? – vgoklani Jan 17 '17 at 08:56
  • 20
    @vgoklani, no, `embedding_lookup` simply provides a convenient (and parallel) way to retrieve embeddings corresponding to id in `ids`. The `params` tensor is usually a tf variable that is learned as part of the training process -- a tf variable whose components are used, directly or indirectly, in a loss function (such as `tf.l2_loss`) which is optimized by an optimizer (such as `tf.train.AdamOptimizer`). – Shobhit Jan 20 '17 at 00:48
  • 1
    @LenarHoyt select_"parallel"_rows – aerin Jul 08 '17 at 19:51
  • 1
    @ToussaintLouverture What do you mean by parallel? – Lenar Hoyt Jul 10 '17 at 02:19
  • 7
    @Rafał Józefowicz Why "the default behavior is that they will represent ids: [0, 3], [1, 4], [2, 5]."? Could you explain? – aerin Aug 16 '17 at 19:50
  • Just to clarify, the exact same functionality can be performed using tf.gather, correct? – mortonjt Mar 20 '18 at 16:19
  • So, this function is used only to get embeddings suitable to the input batches? If that is the case, why not just create embeddings that suits the batch size? – Huzo Oct 30 '18 at 05:55
  • Can someone explain the implementation of it ? Like how it is parallel and how are the keys retrieved from memory ? – Pranjal Sahu Mar 27 '19 at 04:55
  • Is it trainable? – mrgloom Jul 25 '19 at 14:23
  • 2
    `For example, given a list of 3 tensors [2, 64], the default behavior is that they will represent ids: [0, 3], [1, 4], [2, 5]. ` - need explanation. – mrgloom Jul 25 '19 at 14:25
49

Yes, the purpose of tf.nn.embedding_lookup() function is to perform a lookup in the embedding matrix and return the embeddings (or in simple terms the vector representation) of words.

A simple embedding matrix (of shape: vocabulary_size x embedding_dimension) would look like below. (i.e. each word will be represented by a vector of numbers; hence the name word2vec)


Embedding Matrix

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862
like 0.36808 0.20834 -0.22319 0.046283 0.20098 0.27515 -0.77127 -0.76804
between 0.7503 0.71623 -0.27033 0.20059 -0.17008 0.68568 -0.061672 -0.054638
did 0.042523 -0.21172 0.044739 -0.19248 0.26224 0.0043991 -0.88195 0.55184
just 0.17698 0.065221 0.28548 -0.4243 0.7499 -0.14892 -0.66786 0.11788
national -1.1105 0.94945 -0.17078 0.93037 -0.2477 -0.70633 -0.8649 -0.56118
day 0.11626 0.53897 -0.39514 -0.26027 0.57706 -0.79198 -0.88374 0.30119
country -0.13531 0.15485 -0.07309 0.034013 -0.054457 -0.20541 -0.60086 -0.22407
under 0.13721 -0.295 -0.05916 -0.59235 0.02301 0.21884 -0.34254 -0.70213
such 0.61012 0.33512 -0.53499 0.36139 -0.39866 0.70627 -0.18699 -0.77246
second -0.29809 0.28069 0.087102 0.54455 0.70003 0.44778 -0.72565 0.62309 

I split the above embedding matrix and loaded only the words in vocab which will be our vocabulary and the corresponding vectors in emb array.

vocab = ['the','like','between','did','just','national','day','country','under','such','second']

emb = np.array([[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862],
   [0.36808, 0.20834, -0.22319, 0.046283, 0.20098, 0.27515, -0.77127, -0.76804],
   [0.7503, 0.71623, -0.27033, 0.20059, -0.17008, 0.68568, -0.061672, -0.054638],
   [0.042523, -0.21172, 0.044739, -0.19248, 0.26224, 0.0043991, -0.88195, 0.55184],
   [0.17698, 0.065221, 0.28548, -0.4243, 0.7499, -0.14892, -0.66786, 0.11788],
   [-1.1105, 0.94945, -0.17078, 0.93037, -0.2477, -0.70633, -0.8649, -0.56118],
   [0.11626, 0.53897, -0.39514, -0.26027, 0.57706, -0.79198, -0.88374, 0.30119],
   [-0.13531, 0.15485, -0.07309, 0.034013, -0.054457, -0.20541, -0.60086, -0.22407],
   [ 0.13721, -0.295, -0.05916, -0.59235, 0.02301, 0.21884, -0.34254, -0.70213],
   [ 0.61012, 0.33512, -0.53499, 0.36139, -0.39866, 0.70627, -0.18699, -0.77246 ],
   [ -0.29809, 0.28069, 0.087102, 0.54455, 0.70003, 0.44778, -0.72565, 0.62309 ]])


emb.shape
# (11, 8)

Embedding Lookup in TensorFlow

Now we will see how can we perform embedding lookup for some arbitrary input sentence.

In [54]: from collections import OrderedDict

# embedding as TF tensor (for now constant; could be tf.Variable() during training)
In [55]: tf_embedding = tf.constant(emb, dtype=tf.float32)

# input for which we need the embedding
In [56]: input_str = "like the country"

# build index based on our `vocabulary`
In [57]: word_to_idx = OrderedDict({w:vocab.index(w) for w in input_str.split() if w in vocab})

# lookup in embedding matrix & return the vectors for the input words
In [58]: tf.nn.embedding_lookup(tf_embedding, list(word_to_idx.values())).eval()
Out[58]: 
array([[ 0.36807999,  0.20834   , -0.22318999,  0.046283  ,  0.20097999,
         0.27515   , -0.77126998, -0.76804   ],
       [ 0.41800001,  0.24968   , -0.41242   ,  0.1217    ,  0.34527001,
        -0.044457  , -0.49687999, -0.17862   ],
       [-0.13530999,  0.15485001, -0.07309   ,  0.034013  , -0.054457  ,
        -0.20541   , -0.60086   , -0.22407   ]], dtype=float32)

Observe how we got the embeddings from our original embedding matrix (with words) using the indices of words in our vocabulary.

Usually, such an embedding lookup is performed by the first layer (called Embedding layer) which then passes these embeddings to RNN/LSTM/GRU layers for further processing.


Side Note: Usually the vocabulary will also have a special unk token. So, if a token from our input sentence is not present in our vocabulary, then the index corresponding to unk will be looked up in the embedding matrix.


P.S. Note that embedding_dimension is a hyperparameter that one has to tune for their application but popular models like Word2Vec and GloVe uses 300 dimension vector for representing each word.

Bonus Reading word2vec skip-gram model

Community
  • 1
  • 1
kmario23
  • 57,311
  • 13
  • 161
  • 150
18

Here's an image depicting the process of embedding lookup.

Image: Embedding lookup process

Concisely, it gets the corresponding rows of a embedding layer, specified by a list of IDs and provide that as a tensor. It is achieved through the following process.

  1. Define a placeholder lookup_ids = tf.placeholder([10])
  2. Define a embedding layer embeddings = tf.Variable([100,10],...)
  3. Define the tensorflow operation embed_lookup = tf.embedding_lookup(embeddings, lookup_ids)
  4. Get the results by running lookup = session.run(embed_lookup, feed_dict={lookup_ids:[95,4,14]})
Fabian N.
  • 3,807
  • 2
  • 23
  • 46
thushv89
  • 10,865
  • 1
  • 26
  • 39
7

When the params tensor is in high dimensions, the ids only refers to top dimension. Maybe it's obvious to most of people but I have to run the following code to understand that:

embeddings = tf.constant([[[1,1],[2,2],[3,3],[4,4]],[[11,11],[12,12],[13,13],[14,14]],
                          [[21,21],[22,22],[23,23],[24,24]]])
ids=tf.constant([0,2,1])
embed = tf.nn.embedding_lookup(embeddings, ids, partition_strategy='div')

with tf.Session() as session:
    result = session.run(embed)
    print (result)

Just trying the 'div' strategy and for one tensor, it makes no difference.

Here is the output:

[[[ 1  1]
  [ 2  2]
  [ 3  3]
  [ 4  4]]

 [[21 21]
  [22 22]
  [23 23]
  [24 24]]

 [[11 11]
  [12 12]
  [13 13]
  [14 14]]]
Yan Zhao
  • 359
  • 4
  • 15
3

Another way to look at it is , assume that you flatten out the tensors to one dimensional array, and then you are performing a lookup

(eg) Tensor0=[1,2,3], Tensor1=[4,5,6], Tensor2=[7,8,9]

The flattened out tensor will be as follows [1,4,7,2,5,8,3,6,9]

Now when you do a lookup of [0,3,4,1,7] it will yeild [1,2,5,4,6]

(i,e) if lookup value is 7 for example , and we have 3 tensors (or a tensor with 3 rows) then,

7 / 3 : (Reminder is 1, Quotient is 2) So 2nd element of Tensor1 will be shown, which is 6

3

Since I was also intrigued by this function, I'll give my two cents.

The way I see it in the 2D case is just as a matrix multiplication (it's easy to generalize to other dimensions).

Consider a vocabulary with N symbols. Then, you can represent a symbol x as a vector of dimensions Nx1, one-hot-encoded.

But you want a representation of this symbol not as a vector of Nx1, but as one with dimensions Mx1, called y.

So, to transform x into y, you can use and embedding matrix E, with dimensions MxN:

y = E x.

This is essentially what tf.nn.embedding_lookup(params, ids, ...) is doing, with the nuance that ids are just one number that represents the position of the 1 in the one-hot-encoded vector x.

0

Adding to Asher Stern's answer, params is interpreted as a partitioning of a large embedding tensor. It can be a single tensor representing the complete embedding tensor, or a list of X tensors all of same shape except for the first dimension, representing sharded embedding tensors.

The function tf.nn.embedding_lookup is written considering the fact that embedding (params) will be large. Therefore we need partition_strategy.

aerin
  • 20,607
  • 28
  • 102
  • 140
0

The existing explanations are not enough. The main purpose of this function is to efficiently retrieve the vectors for each word in a given sequence of word indices. Suppose we have the following matrix of embeddings:

embds = np.array([[0.2, 0.32,0.9],
        [0.8, 0.62,0.19],
        [0.0, -0.22,-1.9],
        [1.2, 2.32,6.0],
        [0.11, 0.10,5.9]])

Let's say we have the following sequences of word indices:

data=[[0,1],
     [3,4]]

Now to get the corresponding embedding for each word in our data:

tf.nn.embedding_lookup(
    embds, data
)

out:

array([[[0.2 , 0.32, 0.9 ],
        [0.8 , 0.62, 0.19]],

       [[1.2 , 2.32, 6.  ],
        [0.11, 0.1 , 5.9 ]]])>

Note If embds are not an array or tensor, the output will not be like this (I won't go into details). For example, if embds were a list, the output would be:

array([[0.2 , 0.32],
       [0.8 , 0.62]], dtype=float32)>
Eric Aya
  • 69,473
  • 35
  • 181
  • 253