Why Doc2vec is slower with multiples cores rather than one?

Question

I'm trying to train multiple "documents" (here mostly log format), and the Doc2Vec is taking longer if I'm specifying more than one core (which i have).

My data looks like this:

print(len(train_corpus))

7930196

print(train_corpus[:5])

[TaggedDocument(words=['port', 'ssh'], tags=[0]),
 TaggedDocument(words=['session', 'initialize', 'by', 'client'], tags=[1]),
 TaggedDocument(words=['dfs', 'fsnamesystem', 'block', 'namesystem', 'addstoredblock', 'blockmap', 'update', 'be', 'to', 'blk', 'size'], tags=[2]),
 TaggedDocument(words=['appl', 'selfupdate', 'component', 'amd', 'microsoft', 'windows', 'kernel', 'none', 'elevation', 'lower', 'version', 'revision', 'holder'], tags=[3]),
 TaggedDocument(words=['ramfs', 'tclass', 'blk', 'file'], tags=[4])]

I have 8 cores available:

print(os.cpu_count())

8

I am using gensim 4.1.2, on Centos7. Using this approch (stackoverflow.com/a/37190672/130288), It looks like my BLAS library is OpenBlas, so I setted OPENBLAS_NUM_THREADS=1 on my bashrc (and could be visible from Jupyter, using !echo $OPENBLAS_NUM_THREADS=1 )

This is my test code :

dict_time_workers = dict()
for workers in range(1, 9):

    model =  Doc2Vec(vector_size=20,
                    min_count=1,
                    workers=workers,
                    epochs=1)
    model.build_vocab(train_corpus, update = False)
    t1 = time.time()
    model.train(train_corpus, epochs=1, total_examples=model.corpus_count) 
    dict_time_workers[workers] = time.time() - t1

And the variable dict_time_workers is equal too :

{1: 224.23211407661438, 
2: 273.408652305603, 
3: 313.1667754650116, 
4: 331.1840877532959, 
5: 433.83785605430603,
6: 545.671571969986, 
7: 551.6248495578766, 
8: 548.430994272232}

As you can see, the time taking is increasing instead of decreasing. Results seems to be the same with a bigger epochs parameters. Nothing is running on my Centos7 except this.

If I look at what's happening on my threads using htop, I see that the right number of thread are used for each training. But, the more threads are used, less the percentage of usage is (for example, with only one thread, 95% is used, for 2 they both used around 65% of their max power, for 6 threads are 20-25% ...). I suspected an IO issue, but iotop showed me that nothing bad is happening at the same disk.

The post seems now to be related to this post Not efficiently to use multi-Core CPU for training Doc2vec with gensim .

score 2 · Accepted Answer · answered Oct 27 '22 at 18:55

2

When getting no benefit from extra cores like that, it's likely that the BLAS library you've got installed is already configured to try to use all cores for every bulk array operation. That means that other attempts to engage more cores, like Gensim's workers specification, just increase the overhead of contention, when each individual worker thread's individual BLAS callouts also try to use 8 threads.

Depending on the BLAS library in use, its own propensity to use more cores can typically be limited by environment variables named something like OPENBLAS_NUM_THREADS and/or MKL_NUM_THREADS.

If you set these to just 1 before your process launches, you may see different, and possibly better, multithreaded behavior.

Note, though: 1 just restores the assumption that every worker-thread only ever engages a single core. Some other mix of BLAS-cores & Gensim-worker-threads might actually achieve the best training throughput & non-contending core-utilization.

And, at least for Gensim workers, the actual thread count value achieving the best throughput will vary based on other model parameters that influence the relative amount of calculation time in highly-parallelizable code-blocks versus highly-contended blocks, especially window, vector_size, & negative. And, there's not really a shortcut to finding the best workers value except via trial-and-error: observing reported training rates in logs over a few minutes of running. (Though: any rate observed in, say, minutes 2-4 of a abbreviated trial run should be representative of the training rate through the whole corpus over multiple epochs.)

(For any system with at least 4 cores, the optimal value with a classic iterable corpus of TaggedDocuments is usually at least 3, no more than the number of cores, but also rarely more than 8-12 threads, due to other inherent sources of contention due to both Gensim's approach to fanning out the work among worker-threads, and the Python 'GIL'.)

Other thoughts:

the build_vocab() step is never multi-threaded, so benchmarking alternate workers values will give a truer readout of their effect by only timing the train() step
ensuring your iterable corpus does as little redundant work (like say IO & tokenization) on each pass can help limit any bottlenecks around the single manager thread doing each epoch's iteration & batching texts to the workers
the alternate corpus_file approach can achieve higher core utilization, up to any number of cores, by assigning each thread its own exclusive range of an input-file. But, it also means (a) your whole corpus must be in one uncompressed space-tokenized plain-text file; (b) your documents only get a single integer tag (their line-number); (c) you may be subject to some small as-yet-diagnosed-and-fixed bug(s). (See project issue #2747.)

answered Oct 27 '22 at 18:55

gojomo

52,260
14
86
115

Thanks for your answer. I had issue finding which BLAS I have. Using this tutorial on scipy (caam37830.github.io/book/02_linear_algebra/blas_lapack.html), I assumed I have OPENBLAS. So I restarted my Jupyter, starting with !export OPENBLAS_NUM_THREADS=1. But sadly, I have the exact same result with my code than before. Do you have any other idea ? – Naindlac Oct 28 '22 at 07:32
An export from inside a cell may not affect the currently-running Python interpreter. Check by using `os.environ`. (You *might* be able to set the relevant var there in a way that affects the BLAS library, too, if it's set before that library gets loaded - I'm not sure.) – gojomo Oct 28 '22 at 07:53
Sorry for the late answer. Put it on my bashrc, and the variable is visible using os.environ. But I still have exactly the same result than before, nothing changed. – Naindlac Oct 31 '22 at 10:50
I'd not assume you have OpenBLAS based on any online source; you should check your own system, using approaches like those described at , to be sure. (For example, I'm pretty sure any `conda`-based environment tends to install the often-faster Intel MKL.) You can also set the other `MKL_NUM_THREADS` variable just in case. I'd also against suggest tightening your timing to only evaluate the `train()` that uses multiple workers, for starker results, rather than the single-threaded build-vocab. – gojomo Oct 31 '22 at 15:19
Another step that could confirm/refute whether BLAS multithreading is involved: during a 1-workers run, does a tool like `top`/`htop`/etc, show (some periods of) >100% CPU utilization (many cores all highly active)? If so, *something* is effectively using many cores, even though Gensim has only requested one worker thread. Note also: a tiny `vector_size=20` value somewhat limits the potential speedups from BLAS optimizations & threading, as it means relatively-less time in the bulk calculation blocks that gain the most. – gojomo Oct 31 '22 at 15:22
If tinkering with the BLAS threads values don't change the observed behavior at all, please extend your question with the latest results for all `workers` values in both the original and explicitly-set THREADS=1 conditions. – gojomo Oct 31 '22 at 15:24
1

Following your advice, I updated a lot the initial question. – Naindlac Nov 02 '22 at 09:30
Without seeing it live, it's hard to be sure how to interpret `htop` info, but the vague pattenr you've described – ower utilization per thread with more threads – is roughly what's expected in the classic corpus-iterable mode, even when more threads are helpgin, because of inherent contention there. I'm still not sure you tried the right env var, given how many reports there are of people being unsure which BLAS library is active. You could also try to explicitly use MKL (such as via a `conda` install) to improve this & other perf issues. (It's typically faster than OpenBLAS.) – gojomo Nov 02 '22 at 16:40
Note that given your discreption of the corpus as being completely object-in-memory, there should be essentially *zero* IO during your `train()` call, when using the list iterable corpus. IO would only be a contributor if your `train_corpus` were something other than a list, streaming data from a volume, or if you'd somehow started relying on swapped virtual memory (a big no-no for an algorithm like this). – gojomo Nov 02 '22 at 16:44
Given your updates, it remains possible the tiny `vector_size=20` is a major contributor to the slowdown; by making the parallelizable bulk array ops shorter, that forceloses much potential parallelism. If your vocabulary is large enough I'd try a larger `vector_size`. (With 10s of thousands of unique terms, perhaps `vector_size=100`; with hundreds-of-thousands or more, `vector_size=300`.) – gojomo Nov 02 '22 at 16:46

score 0 · Answer 2 · answered Nov 02 '22 at 12:52

0

Ok, the best way to fully use the core is to use the parameter corpus_file of doc2vec.

Doing the same bench, the result looks like :

{1: 114.58889961242676, 2: 82.8250150680542, 3: 71.52109575271606, 4: 67.1010684967041, 5: 75.96869373321533, 6: 100.68377351760864, 7: 116.7901406288147, 8: 139.53436756134033}

The thread seems to be useful, in my case 4 are the best. Still strange that the "regular" doc2vec is not that great at parallelizing

answered Nov 02 '22 at 12:52

Naindlac

11
5

Glad the `corpus_file` suggestion helped! If >4 is worse than 4, you might not have 8 true physical cores, but 8 simulated 'hyperthreading' cores on 4 physical cores. In such a case, stuff that could theoretically saturate 8 cores will actually do worse, because instead of 4 cores ideally-saturated by exactly 4 threads facing no io/coordination-blocking, you've got 8 max-intensity threads fighting for the 4 cores. – gojomo Nov 02 '22 at 16:50
@gojomo Yep, I think that you're right. I'll stop there since I got what I want, thanks for all the help. Time to tune the model ! – Naindlac Nov 03 '22 at 07:59
FYI - here's an answer on a way to check if your true number of physical cores is only 4: . Good luck! – gojomo Nov 03 '22 at 14:30

Why Doc2vec is slower with multiples cores rather than one?

2 Answers2