How to train millions of doc2vec embeddings using GPU?

Question

I am trying to train a doc2vec based on user browsing history (urls tagged to user_id). I use chainer deep learning framework.

There are more than 20 millions (user_id and urls) of embeddings to initialize which doesn’t fit in a GPU internal memory (maximum available 12 GB). Training on CPU is very slow.

I am giving an attempt using code written in chainer given here https://github.com/monthly-hack/chainer-doc2vec

Please advise options to try if any.

What `options to try` did you find (and dismiss?) yourself? – greybeard Dec 30 '18 at 13:12 — greybeard, Dec 30 '18 at 13:12

score 2 · Answer 1 · answered Dec 31 '18 at 03:14

2

You may also refer chainer official word2vec example.

https://github.com/chainer/chainer/tree/master/examples/word2vec

Did you already try training with GPU? Usually, only batch size data is extracted to GPU memory, so total number of data (20M) do not affect GPU memory limit.

answered Dec 31 '18 at 03:14

corochann

1,604
1
13
24

20M is size of vocab for which embeddings needs to be initialized and copied to gpu before training starts. Size of data is not a concern as it can be batched. – Aljo Jose Dec 31 '18 at 18:45

How to train millions of doc2vec embeddings using GPU?

1 Answers1