How to perform doc2vec.infer_vector() on millions of documents?

Question

I trained a doc2vec model using python gensim on a corpus of 40,000,000 documents. This model is used for infering docvec on millions of documents everyday. To ensure stability, I set alpha to a small value and a large steps instead of setting a constant random seed:

from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec.load('doc2vec_dm.model')
doc_demo = ['a','b']
# model.random.seed(0)
model.infer_vector(doc_demo, alpha=0.1, min_alpha=0.0001, steps=100)

doc2vec.infer_vector() accepts only one documents each time and it takes almost 0.1 second to infer each docvec. Is there any API that can handle a series of documents in each infering step?

gojomo · Accepted Answer · 2019-12-25T21:19:29.950

Currently, there's no gensim API which does large batches of inference at once, which could help by using multiple threads. It is a wishlist item, among other improvements: https://github.com/RaRe-Technologies/gensim/issues/515

You might get some speedup, up to the number of cores in your CPU, by spreading your own inference jobs over multiple threads.

To eliminate all multithreaded contention due to the Python GIL, you could spread your inference over separate Python processes. If each process loads the model using some of the tricks described at another answern (see below), the OS will help them share the large model backing arrays (only paying the cost in RAM once), while they each could completely independently due one unblocking thread of inference.

(Specifically, Doc2Vec.load() can also use the mmap='r' mode to load an existing on-disk model with memory-mapping of the backing files. Inference alone, with no most_similar()-like operations, will only read the shared raw backing arrays, so no fussing with the _norm variants should be necessary if you're launching single-purpose processes that just do inference then save their results and exit.)

How to perform doc2vec.infer_vector() on millions of documents?

1 Answers1