Multithreading CPU load

Question

I'm trying to run a program external to Python with multithreading using this code:

def handle_multiprocessing_pool(num_threads: int, partial: Callable, variable: list) -> list:
    progress_bar = TqdmBar(len(variable))
    with multiprocessing.pool.ThreadPool(num_threads) as pool:
        jobs = [
            pool.apply_async(partial, (value,), callback=progress_bar.update_progress_bar)
            for value in variable
        ]
        pool.close()
        processing_results = []
        for job in jobs:
            processing_results.append(job.get())
        pool.join()
    return processing_results

The Callable being called here loads an external program (with a C++ back-end), runs it and then extracts some data. Inside its GUI, the external program has an option to run cases in parallel, each case is assigned to a thread, from which I assumed it would be best to work with multithreading (instead of multiprocessing).

The script is running without issues, but I cannot quite manage to utilize the CPU power of our machine efficiently. The machine has 64 cores with 2 threads each. I will list some of my findings about the CPU utilisation.

When I run the cases from the GUI, it manages to utilize 100% CPU power.
When I run the script on 120 threads, it seems like only half of the threads are properly engaged:

The external program allows me to run on two threads, however if I run 60 parallel processes on 2 threads each, the utilisation looks similar.
When I run two similar scripts on 60 threads each, the full CPU power is properly used:

I have read about the Global Interpreter Lock in Python, but the multiprocessing package should circumvent this, right? Before test #4, I was assuming that for some reason the processes were still running on cores and the two threads on each were not able to run concurrently (this seems suggested here: multiprocessing.Pool vs multiprocessing.pool.ThreadPool), but especially the behaviour from #4 above is puzzling me.

I have tried the suggestions here Why does multiprocessing use only a single core after I import numpy? which unfortunately did not solve the problem.

`multiprocessing.pool.ThreadPool` runs only on 1 cpu and is useful only for IO based parallelism. — vks, Jan 27 '23 at 09:45
What does your C++ is supposed to do? Does it run BLAS primitive or any parallel stuff? Note that multiprocessing create processes and not threads and the former does not operate in shared memory (at least not by default) so data transfer needs to be done as well as pikling. This generally introduce some significant overhead on large input/output data, but this is the way CPython works... — Jérôme Richard, Jan 27 '23 at 10:07
Note that CPython threads can sometime truly run in parallel for computational works though this is rare. More specifically the target modules need to release the GIL for this to be true. Numpy does that for parts of its computing functions but it generally does not scale well unless you work on huge arrays, especially on such target platform. — Jérôme Richard, Jan 27 '23 at 10:09
Also note that AMD TR is a NUMA machine with strong NUMA effects so you need to care about them on such machine. If you do not, then accessing data in shared memory can be much slower and not scale at all since only 1 memory node may work and likely be saturated (while many are available). Multiprocessing solve this problem unless you manually use shared memory. You can also randomize page access but this is generally not great. Anyway, this does not explain the CPU utilisation since core waiting for a remote node should be mark as active during this time. — Jérôme Richard, Jan 27 '23 at 10:14

Multithreading CPU load

0 Answers0