What am I setting when I limit the number of "threads"?

Question

I have a somewhat large code that uses the libraries numpy, scipy, sklearn, matplotlib. I need to limit the CPU usage to stop it from consuming all the available processing power in my computational cluster. Following this answer I implemented the following block of code that is executed as soon as the script is run:

import os
parallel_procs = "4"
os.environ["OMP_NUM_THREADS"] = parallel_procs
os.environ["MKL_NUM_THREADS"] = parallel_procs
os.environ["OPENBLAS_NUM_THREADS"] = parallel_procs
os.environ["VECLIB_MAXIMUM_THREADS"] = parallel_procs
os.environ["NUMEXPR_NUM_THREADS"] = parallel_procs

My understanding is that this should limit the number of cores used to 4, but apparently this is not happening. This is what htop shows for my user and that script:

There are 16 processes, 4 of which show percentages of CPU above 100%. This is an excerpt of lscpu:

CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           2

I am also using the multiprocessing library down the road in my code. I set the same number of processes using multiprocessing.Pool(processes=4). Without the block of code shown in above, the script insisted on using as many cores as possible apparently ignoring multiprocessing entirely.

My questions are then: what am I limiting when I use the code above? How should I interpret the htop output?

Is your Python code multi-threaded or deliberately using multiple processes (as with the `multiprocessing` module)? The environment variables you are setting are likely only to have an effect on library code that may use threads internally. If you're using your own threading though, you might still call those libraries multiple times in parallel. — Blckknght, May 13 '21 at 19:50
I am indeed using the `multiprocessing` library down the road in my code. I set the same number of processes using `multiprocessing.Pool(processes=4)`. I had to add the block of code shown in the question because otherwise the script insisted on using as many cores as possible, apparently ignoring `multiprocessing`. — Gabriel, May 13 '21 at 19:58
The total number of cores being used is likely the product of the number of processes and the number of threads (e.g. 4*4). The library code is not going to be aware of the `multiprocessing` code (or indeed of other library's code, which might be running in parallel). If the libraries you are using do their own threading, you may not need to add your own parallel processing. — Blckknght, May 13 '21 at 20:05
This seems related to the recent question: https://stackoverflow.com/questions/67474542/why-does-multi-processing-slow-down-a-nested-for-loop — Jérôme Richard, May 13 '21 at 20:15

score 2 · Accepted Answer · answered May 14 '21 at 14:00

(This might be better as a comment, feel free to remove this if a better answer comes up, as it's based on my experience using the libraries.)

I had a similar issue when multiprocessing parts of my code. The numpy/scipy libraries appear to spin up extra threads when you do vectorised operations if you compiled the libraries with BLAS or MKL (or if the conda repo you pulled them from also included a BLAS/MKL library), to accelerate certain calculations.

This is fine when running your script in a single process, since it will spawn threads up to the number specified by OPENBLAS_NUM_THREADS or MKL_NUM_THREADS (depending on if you have a BLAS library or MKL library - you can identify which by using numpy.__config__.show()), but if you are explicitly using a multiprocesing.Pool, then you likely want to control the number of processes in multiprocessing - in this case, it makes sense to set n=1 (before importing numpy & scipy), or some small number to make sure you are not oversubscribing:

n = '1'
os.environ["OMP_NUM_THREADS"] = n
os.environ["MKL_NUM_THREADS"] = n

If you set multiprocessing.Pool(processes=4), it will use 4*n processes (n threads in each process). In your case, it seems like you have a pool of 4 processes and they fire up 4 threads each, hence the 16 python processes.

The htop output gives 100% assuming a single CPU per core. As a Linux machine interprets a thread as a CPU (I might be wrong in the terminology here), if you have 4 threads per CPU, it means that the full load is actually 400%. This might not be maxed out, depending on the operations being performed (and on caching, as your machine looks hyperthreaded).

So if you're doing the numpy/scipy operation in parts of the code which are in a single process/single thread, you are better off setting a larger n, but for the multiprocessing sections, it might be better to set a larger pool and single or small n. Unfortunately, you can only set this once, at the beginning of your script if you're passing in flags through the environmental flags. If you want to set it dynamically, I saw in a numpy issues discussion somewhere that you should use threadpoolctl (I'll add a link if I can find it again).

Sorry I'm not able to give better sources - it was a bit of learning by experience when I just needed things to work. — Tim Jim, May 15 '21 at 15:53

What am I setting when I limit the number of "threads"?

1 Answers1