How to deal with large(>2GB) embedding lookup table in tensorflow?

Question

When I use pre-trained word vectors to do classification with LSTM, I wondered how to deal with embedding lookup table larger than 2gb in tensorflow.

To do this, I tried to make embedding lookup table like the code below,

data = tf.nn.embedding_lookup(vector_array, input_data)

got this value error.

ValueError: Cannot create a tensor proto whose content is larger than 2GB

variable vector_array on the code is numpy array, and it contains about 14 million unique tokens and 100 dimension word vectors for each word.

thank you for your helping with

ltt · Answer 1 · 2018-02-10T11:52:57.530

You need to copy it to a tf variable. There's a great answer to this question in StackOverflow: Using a pre-trained word embedding (word2vec or Glove) in TensorFlow

This is how I did it:

embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]),trainable=False, name="embedding_weights") 
embedding_placeholder = tf.placeholder(tf.float32, [embedding_vocab_size, EMBEDDING_DIM])
embedding_init = embedding_weights.assign(embedding_placeholder)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix})

You can then use the embedding_weights variable for performing the lookup (remember to store word-index mapping)

Update: Use of the variable is not required but it allows you to save it for future use so that you don't have to re-do the whole thing again (it takes a while on my laptop when loading very large embeddings). If that's not important, you can simply use placeholders like Niklas Schnelle suggested

passing a huge matrix each run seems very expensive. Is it faster to just do the lookup outside TF and then pass the embdedded data directly? — DankMasterDan, Mar 23 '18 at 23:13
@DankMasterDan, I tried it, but ended up with very huge vectors to store. — user1700890, May 03 '18 at 20:17

score 4 · Accepted Answer · answered Jan 18 '18 at 10:37

4

For me the accepted answer doesn't seem to work. While there is no error the results were terrible (when compared to a smaller embedding via direct initialization) and I suspect the embeddings were just the constant 0 the tf.Variable() is initialized with.

Using just a placeholder without an extra variable

self.Wembed = tf.placeholder(
    tf.float32, self.embeddings.shape,
    name='Wembed')

and then feeding the embedding on every session.run() of the graph seems to work however.

answered Jan 18 '18 at 10:37

Niklas Schnelle

1,139
1
9
11

Sorry to hear it didn't work for you. When I tried it, I checked individual weights using interactive session and they were definitely not all 0s. Also, I merely quoted an answer on Stackoverflow with over 93 upvotes, which was given by an employee at Google, so not sure what went wrong - perhaps there's some typo in my code or yours. You can certainly do what you suggested and not use a variable but I don't think you'll be able to make the tensor persist that way. I save my variable so that I don't have to run this memory-consuming process in the future - I just restore saved variable. – ltt Feb 10 '18 at 11:43
Note that there is currently an [issue](https://github.com/tensorflow/tensorflow/issues/17233) with Tensorflow 1.6 that slows this solution down significantly. – Niklas Schnelle Mar 13 '18 at 10:23

score 0 · Answer 3 · answered Jul 18 '18 at 18:23

Using feed_dict with large embeddings was too slow for me with TF 1.8, probably due to the issue mentioned by Niklas Schnelle.

I ended up with the following code:

embeddings_ph = tf.placeholder(tf.float32, wordVectors.shape, name='wordEmbeddings_ph')
embeddings_var = tf.Variable(embeddings_ph, trainable=False, name='wordEmbeddings')
embeddings = tf.nn.embedding_lookup(embeddings_var,input_data)
.....
sess.run(tf.global_variables_initializer(), feed_dict={embeddings_ph:wordVectors})

How to deal with large(>2GB) embedding lookup table in tensorflow?

3 Answers3