Gensim word2vec / doc2vec multi-threading parallel queries

Question

I would like to call model.wv.most_similar_cosmul, on the same copy of model object, using multiple cores, on batches of input pairs.

The multiprocessing module requires multiple copies of model, which will require too much RAM because my model is 30+ GB in RAM.

I have tried to evaluate my query pairs. It took me ~12 hours for the first round. There may be more rounds coming. That's why I am looking for a threading solution. I understand Python has Global Interpreter Lock issue.

Any suggestions?

Which operating system? `multiprocessing` uses `fork` on Linux, so the data should be shared and only copied on write access. — BlackJack, Nov 29 '17 at 15:04
@BlackJack That is strange. How does `Python` know ahead of time if a code segment requires write access or not? I thought if it doesn't know, it has to copy the object for each child at fork time. — Patrick the Cat, Nov 29 '17 at 15:14
Python doesn't know, the operating system does. This hasn't anything to do with Python objects or Python in general but with operating system level processes and memory pages. — BlackJack, Nov 29 '17 at 16:43
See [Does fork() immediately copy the entire process heap in Linux?](https://unix.stackexchange.com/questions/155017). — BlackJack, Nov 29 '17 at 16:49

score 1 · Accepted Answer · answered Nov 30 '17 at 20:26

Forking off processes using multiprocessing after your text-vector model is in memory and unchanging might work to let many processes share the same object-in-memory.

In particular, you'd want to be sure that the automatic generation of unit-normed vectors (into a syn0norm or doctag_syn0norm) has already happened. It'll be automatically triggered the first time it's needed by a most_similar() call, or you can force it with the init_sims() method on the relevant object. If you'll only be doing most-similar queries between unit-normed vectors, never needing the original raw vectors, use init_sims(replace=True) to clobber the raw mixed-magnitude syn0 vectors in-place and thus save a lot of addressable memory.

Gensim also has options to use memory-mapped files as the sources of model giant arrays, and when multiple processes use the same read-only memory-mapped file, the OS will be smart enough to only map that file into physical memory once, providing both processes pointers to the shared array.

For more discussion of the tricky parts of using this technique in a similar-but-not-identical use case, see my answer at:

How to speed up Gensim Word2vec model load time?

score 1 · Answer 2 · answered Jul 30 '21 at 21:36

Gensim v4.x.x simplified a lot of what @gojomo described above, as he also explained in his other answer here. Based on those answers, here's an example of how you can multiprocess most_similar in a memory-efficient way, including logging of progress with tqdm. Swap in your own model/dataset to see how this works at scale.

import multiprocessing
from functools import partial
from typing import Dict, List, Tuple

import tqdm
from gensim.models.word2vec import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from gensim.test.utils import common_texts


def get_most_similar(
    word: str, keyed_vectors: KeyedVectors, topn: int
) -> List[Tuple[str, float]]:
    try:
        return keyed_vectors.most_similar(word, topn=topn)
    except KeyError:
        return []


def get_most_similar_batch(
    word_batch: List[str], word_vectors_path: str, topn: int
) -> Dict[str, List[Tuple[str, float]]]:
    # Load the keyedvectors with mmap, so memory isn't duplicated
    keyed_vectors = KeyedVectors.load(word_vectors_path, mmap="r")
    return {word: get_most_similar(word, keyed_vectors, topn) for word in word_batch}


def create_batches_from_iterable(iterable, batch_size=1000):
    return [iterable[i : i + batch_size] for i in range(0, len(iterable), batch_size)]


if __name__ == "__main__":
    model = Word2Vec(
        sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4
    )

    # Save wv, so it can be reloaded with mmap later
    word_vectors_path = "word2vec.wordvectors"
    model.wv.save(word_vectors_path)

    # Dummy set of words to find most similar words for
    words_to_match = list(model.wv.key_to_index.keys())

    # Multiprocess
    batches = create_batches_from_iterable(words_to_match, batch_size=2)
    partial_func = partial(
        get_most_similar_batch,
        word_vectors_path=word_vectors_path,
        topn=5,
    )

    words_most_similar = dict()
    num_workers = multiprocessing.cpu_count()
    with multiprocessing.Pool(num_workers) as pool:
        max_ = len(batches)
        with tqdm.tqdm(total=max_) as pbar:
            # imap required for tqdm to function properly
            for result in pool.imap(partial_func, batches):
                words_most_similar.update(result)
                pbar.update()

Gensim word2vec / doc2vec multi-threading parallel queries

2 Answers2