Training a CNN with pre-trained word embeddings is very slow (TensorFlow)

Question

I'm using TensorFlow (0.6) to train a CNN on text data. I'm using a method similar to the second option specified in this SO thread (with the exception that the embeddings are trainable). My dataset is pretty small and the vocabulary is around 12,000 words. When I train using random word embeddings everything works nicely. However, when I switch to the pre-trained embeddings from the word2vec site, the vocabulary grows to over 3,000,000 words and training iterations become over 100 times slower. I'm also seeing this warning:

UserWarning: Converting sparse IndexedSlices to a dense Tensor with 900482700 elements

I saw the discussion on this TensorFlow issue, but I'm still not sure if the slowdown I'm experiencing is expected or if it's a bug. I'm using the Adam optimizer but it's pretty much the same thing with Adagrad.

One workaround I guess I could try is to train using a minimal embedding matrix with only the ~12,000 words in my dataset, serialize the resulting embeddings and at runtime merge them with the remaining words from the pre-trained embeddings. I think this should work but it sounds hacky.

Is that currently the best solution or am I missing something?

Can you share the code where you define the embedding? This conversion to a dense tensor should only happen if the `params` argument to `tf.gather()` is the result of an op that doesn't have a gradient function with a specialization for `IndexedSlices`. However, if you're using pre-trained embeddings, this shouldn't need to be the case. — mrry, Mar 09 '16 at 05:29
@mrry, thanks. You were right, my code was also calculating some summary statistics of the embeddings gradient and that caused the conversion to a dense tensor. See my answer below. — hillel, Mar 09 '16 at 17:07

hillel · Accepted Answer · 2016-03-09T17:51:45.633

3

So there were two issues here:

As mrry pointed out in his comment to the question, the warning was not a result of a conversion during the updates. Rather, I was calculating summary statistics (sparsity and histogram) on the embeddings gradient and that caused the conversion.
Interestingly, removing the summaries made the message go away, but the code remained slow. Per the TensorFlow issue referenced in the question, I had to also replace the AdamOptimizer with the AdagradOptimizer and once I did that the runtime was back on par with the one obtained from a small vocabulary.

edited Mar 09 '16 at 17:51

answered Mar 09 '16 at 17:06

hillel

2,343
2
18
25

That's interesting that it was summary statistics that caused the densification in (1), especially if they don't participate in the gradient computation. This could point to a bug in `gradients.py`, so if you have the time, please file a [GitHub issue](https://github.com/tensorflow/tensorflow/issues) with a snippet of code that yields the warning. – mrry Mar 09 '16 at 17:18
@mrry I was running the train_op and the summaries_op together at each training step. If you still think it's a bug let me know and I'll open an issue. – hillel Mar 09 '16 at 18:19
The summaries op shouldn't influence the gradients (assuming the summary statistics aren't somehow used in the loss calculation) so that sounds like a bug to me! – mrry Mar 09 '16 at 18:20

Training a CNN with pre-trained word embeddings is very slow (TensorFlow)

1 Answers1

Linked