Not efficiently to use multi-Core CPU for training Doc2vec with gensim

Question

I am using 24 cores virtual CPU and 100G memory to training Doc2Vec with Gensim, but the usage of CPU always is around 200% whatever to modify the number of cores.

top

htop

The above two pictures showed the percentage of cpu usage, this pointed out that cpu wasn't used efficiently.

cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
    # PV-DBOW plain
    Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores),
    # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
    Doc2Vec(dm=1, vector_size=100, window=10, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores, alpha=0.05, comment='alpha=0.05'),
    # PV-DM w/ concatenation - big, slow, experimental mode
    # window=5 (both sides) approximates paper's apparent 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=5, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores),
]

for model in simple_models:
    model.build_vocab(all_x_w2v)
    print("%s vocabulary scanned & state initialized" % model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

Edit:

I tried to use parameter corpus_file instead of documents, and resolved above problem. but, I need to adjust the code and convert all_x_w2v to file, and all_x_w2v didn't directly do this.

score 5 · Accepted Answer · answered Aug 16 '19 at 23:37

The Python Global Interpreter Lock ("GIL") and other interthread-bottlenecks prevent its code from saturating all CPU cores with the classic gensim Word2Vec/Doc2Vec/etc flexible corpus-iterators – where you can supply any re-iterable sequence of the texts.

You can improve the throughput a bit with steps like:

larger values of negative, size, & window
avoiding any complicated steps (like tokenization) in your iterator – ideally it will just be streaming from a simple on-disk format
experimenting with different worker counts – the optimal count will vary based on your other parameters & system details, but is often in the 3-12 range (no matter how many more cores you have)

Additionally, recent versions of gensim offer an alternative corpus-specification method: a corpus_file pointer to an already space-delimited, text-per-line file. If you supply your texts this way, multiple threads will each read the raw file in optimized code – and it's possible to achieve much higher CPU utilization. However, in this mode you lose the ability to specify your own document tags, or more than one tag per document. (The documents will just be given unique IDs based on their line-number in the file.)

See the docs for Doc2Vec, and its parameter corpus_file:

https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec

Not efficiently to use multi-Core CPU for training Doc2vec with gensim

1 Answers1

Linked