I'm using TensorFlow (0.6) to train a CNN on text data. I'm using a method similar to the second option specified in this SO thread (with the exception that the embeddings are trainable). My dataset is pretty small and the vocabulary is around 12,000 words. When I train using random word embeddings everything works nicely. However, when I switch to the pre-trained embeddings from the word2vec site, the vocabulary grows to over 3,000,000 words and training iterations become over 100 times slower. I'm also seeing this warning:
UserWarning: Converting sparse IndexedSlices to a dense Tensor with 900482700 elements
I saw the discussion on this TensorFlow issue, but I'm still not sure if the slowdown I'm experiencing is expected or if it's a bug. I'm using the Adam optimizer but it's pretty much the same thing with Adagrad.
One workaround I guess I could try is to train using a minimal embedding matrix with only the ~12,000 words in my dataset, serialize the resulting embeddings and at runtime merge them with the remaining words from the pre-trained embeddings. I think this should work but it sounds hacky.
Is that currently the best solution or am I missing something?