How to find ideal number of parallel processes to run with python multiprocessing?

Question

Trying to find out the correct number of parallel processes to run with python multiprocessing.

Scripts below are run on an 8-core, 32 GB (Ubuntu 18.04) machine. (There were only system processes and basic user processes running while the below was tested.)

Tested multiprocessing.Pool and apply_async with the following:

from multiprocessing import current_process, Pool, cpu_count
from datetime import datetime
import time

num_processes = 1 # vary this

print(f"Starting at {datetime.now()}")
start = time.perf_counter()

print(f"# CPUs = {cpu_count()}") # 8
num_procs = 5 * cpu_count() # 40


def cpu_heavy_fn():
    s = time.perf_counter()
    print(f"{datetime.now()}: {current_process().name}")
    x = 1
    for i in range(1, int(1e7)):
        x = x * i
        x = x / i
    t_taken = round(time.perf_counter() - s, 2)
    return t_taken, current_process().name


pool = Pool(processes=num_processes)

multiple_results = [pool.apply_async(cpu_heavy_fn, ()) for i in range(num_procs)]
results = [res.get() for res in multiple_results]
for r in results:
    print(r[0], r[1])

print(f"Done at {datetime.now()}")
print(f"Time taken = {time.perf_counter() - start}s")

Here are the results:

num_processes total_time_taken
1 28.25
2 14.28
3 10.2
4 7.35
5 7.89
6 8.03
7 8.41
8 8.72
9 8.75
16 8.7
40 9.53

The following make sense to me:

Running one process at a time takes about 0.7 seconds for each process, so running 40 should take about 28s, which agrees with what we observe above.
Running 2 processes at a time should halve the time and this is observed above (~14s).
Running 4 processes at a time should further halve the time and this is observed above (~7s).
Increasing parallelism to more than the number of cores (8) should degrade performance (due to CPU contention) and this is observed (sort of).

What doesn't make sense is:

Why does running 8 in parallel not twice as fast as running 4 in parallel i.e. why is it not ~3.5s?
Why is running 5 to 8 in parallel at a time worse than running 4 at a time? There are 8 cores, but still why is the overall run time worse? (When running 8 in parallel, htop showed all CPUs at near 100% utilization. When running 4 in parallel, only 4 of them were at 100% which makes sense.)

How many tabs do you see in the performance tab in task manger? Need some more context about your hardware to answer. — Edeki Okoh, Mar 04 '20 at 18:14
It is Standard_D8s_v3 (8 vcpus, 32 GiB memory) Azure VM: https://learn.microsoft.com/en-us/azure/virtual-machines/dv3-dsv3-series — arun, Mar 04 '20 at 23:43

score 5 · Answer 1 · edited May 29 '22 at 09:07

Q : "Why is running 5 to 8 in parallel at a time worse than running 4 at a time?"

Well, there are several reasons and we will start from a static, easiest observable one :

Since the silicon design ( for which they used a few hardware tricks ) does not scale beyond the 4.

So the last Amdahl's Law explained & promoted speedup from just +1 upscaled count of processors is 4 and any next +1 will not upscale the performance in that same way observed in the { 2, 3, 4 }-case :

This lstopo CPU-topology map helps to start to decode WHY ( here for 4-cores, but the logic is the same as for your 8-core silicon - run lstopo on your device to see more details in vivo ) :

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Machine (31876MB)                                                                                                 │
│                                                                                                                   │
│ ┌────────────────────────────────────────────────────────────┐                      ┌───────────────────────────┐ │
│ │ Package P#0                                                │  ├┤╶─┬─────┼┤╶───────┤ PCI 10ae:1F44             │ │
│ │                                                            │      │               │                           │ │
│ │ ┌────────────────────────────────────────────────────────┐ │      │               │ ┌────────────┐  ┌───────┐ │ │
│ │ │ L3 (8192KB)                                            │ │      │               │ │ renderD128 │  │ card0 │ │ │
│ │ └────────────────────────────────────────────────────────┘ │      │               │ └────────────┘  └───────┘ │ │
│ │                                                            │      │               │                           │ │
│ │ ┌──────────────────────────┐  ┌──────────────────────────┐ │      │               │ ┌────────────┐            │ │
│ │ │ L2 (2048KB)              │  │ L2 (2048KB)              │ │      │               │ │ controlD64 │            │ │
│ │ └──────────────────────────┘  └──────────────────────────┘ │      │               │ └────────────┘            │ │
│ │                                                            │      │               └───────────────────────────┘ │
│ │ ┌──────────────────────────┐  ┌──────────────────────────┐ │      │                                             │
│ │ │ L1i (64KB)               │  │ L1i (64KB)               │ │      │               ┌───────────────┐             │
│ │ └──────────────────────────┘  └──────────────────────────┘ │      ├─────┼┤╶───────┤ PCI 10bc:8268 │             │
│ │                                                            │      │               │               │             │
│ │ ┌────────────┐┌────────────┐  ┌────────────┐┌────────────┐ │      │               │ ┌────────┐    │             │
│ │ │ L1d (16KB) ││ L1d (16KB) │  │ L1d (16KB) ││ L1d (16KB) │ │      │               │ │ enp2s0 │    │             │
│ │ └────────────┘└────────────┘  └────────────┘└────────────┘ │      │               │ └────────┘    │             │
│ │                                                            │      │               └───────────────┘             │
│ │ ┌────────────┐┌────────────┐  ┌────────────┐┌────────────┐ │      │                                             │
│ │ │ Core P#0   ││ Core P#1   │  │ Core P#2   ││ Core P#3   │ │      │     ┌──────────────────┐                    │
│ │ │            ││            │  │            ││            │ │      ├─────┤ PCI 1002:4790    │                    │
│ │ │ ┌────────┐ ││ ┌────────┐ │  │ ┌────────┐ ││ ┌────────┐ │ │      │     │                  │                    │
│ │ │ │ PU P#0 │ ││ │ PU P#1 │ │  │ │ PU P#2 │ ││ │ PU P#3 │ │ │      │     │ ┌─────┐  ┌─────┐ │                    │
│ │ │ └────────┘ ││ └────────┘ │  │ └────────┘ ││ └────────┘ │ │      │     │ │ sr0 │  │ sda │ │                    │
│ │ └────────────┘└────────────┘  └────────────┘└────────────┘ │      │     │ └─────┘  └─────┘ │                    │
│ └────────────────────────────────────────────────────────────┘      │     └──────────────────┘                    │
│                                                                     │                                             │
│                                                                     │     ┌───────────────┐                       │
│                                                                     └─────┤ PCI 1002:479c │                       │
│                                                                           └───────────────┘                       │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

A closer look, like the one from a call to hwloc-tool: lstopo-no-graphics -.ascii, shows where mutual processing independence ends - here at a level of shared L1-instruction-cache ( the L3 one is shared either, yet at the top of the hierarchy and at such a size that bothers for large problems solvers only, not our case )

Next comes a worse observable reason WHY even worse on 8-processes :

Q : "Why does running 8 in parallel not twice as fast as running 4 in parallel i.e. why is it not ~3.5s?"

Because of thermal management.

The more work is loaded onto CPU-cores, the more heat is produced from driving electrons on ~3.5+ GHz through the silicon maze. Thermal constraints are those, that prevent any further performance boost in CPU computing powers, simply because of the Laws of physics, as we know them, do not permit to grow beyond some material-defined limits.

So what comes next?
The CPU-design has circumvented not the physics ( that is impossible ), but us, the users - by promising us a CPU chip having ~3.5+ GHz ( but in fact, the CPU can use this clock-rate only for small amounts of time - until the dissipated heat does not get the silicon close to the thermal-limits - and then, the CPU will decide to either reduce its own clock-rate as an overheating defensive step ( this reduces the performance, doesn't it? ) or some CPU-micro-architectures may hop ( move a flow of processing ) onto another, free, thus cooler, CPU-core ( which keeps a promise of higher clock-rate there ( at least for some small amount of time ) yet also reduces the performance, as the hop does not occur in zero-time and does not happen at zero-costs ( cache-losses, re-fetches etc )

This picture shows a snapshot of the case of core-hopping - cores 0-19 got too hot and are under the Thermal Throttling cap, while cores 20-39 can ( at least for now ) run at full speed:

The Result?

Both the thermal-constraints ( diving CPU into a pool of liquid nitrogen was demonstrated for a "popular" magazine show, yet is not a reasonable option for any sustainable computing, as the mechanical stress from going from deep frozen state into a 6+ GHz clock-rate steam-forming super-heater cracks the body of the CPU and will result in CPU-death from cracks and mechanical fatigue in but a few workload episodes - so a no-go zone, due to negative ROI for any serious project ).

Good cooling and right-sizing of the pool-of-workers, based on in-vivo pre-testing is the only sure bet here.

Other architecture :

score 2 · Accepted Answer · answered Mar 04 '20 at 21:54

Most likely cause is that you are running the program on a CPU that uses simultaneous multithreading (SMT), better known as hyper-threading on Intel units. To cite after wiki, for each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible. That's what's happening here.

Your OS says 8 cores, but in truth it's 4 cores with SMT. The task is clearly CPU-bound, so any increase beyond physical number of cores does not bring any benefit, only overhead cost of multiprocessing. That's why you see almost linear increase in performance until you reach (physical!) max. number of cores (4) and then decrease when the cores needs be shared for this very CPU-intensive task.

Thank you. Figured out the number of physical cores using https://stackoverflow.com/a/23378780/1333610. It is indeed 4! — arun, Mar 04 '20 at 23:50
@arun Excellent linked article. Since you're running this on a cloud VM, knowledge of CPU type is not helping. Server CPUs are typically shared between VMs and not unlikely one you're running has e.g. 10 physical cores (but 4 are assigned to you). — Lukasz Tracewski, Mar 05 '20 at 07:27

How to find ideal number of parallel processes to run with python multiprocessing?

2 Answers2

Next comes a worse observable reason WHY even worse on 8-processes :

The Result?

Linked