1

I am using multi-processing in python 3.7

Some articles say that a good number for number of processes to be used in Pool is the number of CPU cores.

My AMD Ryzen CPU has 8 cores and can run 16 threads.

So, should the number of processes be 8 or 16?

import multiprocessing as mp
pool = mp.Pool( processes = 16 )         # since 16 threads are supported?
user3666197
  • 1
  • 6
  • 50
  • 92
user1315789
  • 3,069
  • 6
  • 18
  • 41
  • 1
    Following my experience from multithreading, you are right. However, if the "pieces" of work become too small or you have too many locked variables, more processes will slow you down. – Rennnyyy May 26 '20 at 05:38
  • 1
    Have you seen this [post](https://stackoverflow.com/questions/1006289/how-to-find-out-the-number-of-cpus-using-python) that points to `multiprocessing.cpu_count()` ? – Balaji Ambresh May 26 '20 at 05:44
  • 1
    The traditional recommendation is the number of possible threads plus one. However, more threads doesn't automatically translate into better efficiency, as it depends on the work being done, synchronization needed between threads, possible I/O, and many other things. To begin with you need to add statistics collection to make sure that your current pool is fully used, that all threads are used continuously without being idle. Then it's a sign that you might need to increase the number of threads in the pool. – Some programmer dude May 26 '20 at 05:44
  • It depends what you are doing. If it is something with a very high latency, you may be better off having many more processes than CPU cores. – Mark Setchell May 26 '20 at 06:44

1 Answers1

1

Q : "So, should the number of processes be 8 or 16?"


So, should the herd of sub-processes distributed workloads are cache re-use intensive (not memory-I/O), the SpaceDOMAIN-constraints rule, as the size of the cache-able data will play cardinal role in deciding if 8 or 16.

Why ?
Because the costs of memory-I/O are about a thousand times more expensive in the TimeDOMAIN, paying about 3xx - 4xx [ns] per memory-I/O, compared to 0.1 ~ 0.4 [ns] for in-cache data.

How to Make The Decision ?
Make a small scale test, before deciding on production scale configuration.


So, should the herd of to-be distributed workloads are network-I/O, or other remarkable (locally non-singular) source of latency, dependent, the TimeDOMAIN may benefit from doing a latency-masking trick, running 16, 160 or merely 1600 threads ( not processes in this case ).

Why ?
Because the costs of doing the over-the-network-I/O provide so much waiting-time ( a few [ms] of network-I/O RTT latency are time enough to do about 1E7 ~ 10.000.000 per CPU-core uop-s, which is quite a lot of work. So, smart interleaving of even the whole processes, here also just using the latency-masked thread-based concurrent processing may fit ( as the threads waiting for the remote "answer" from over-the-network-I/O ought not fight for a GIL-lock, as they have nothing to compute until they receive their expected I/O-bytes back, have they? )

How to Make The Decision ?
Review the code to determine how many over-the-network-I/O fetches and how many about the cache-footprint sized reads are in the game (in 2020/Q2+ L1-caches grew to about a few [MB]-s). For those cases, where these operations repeat many times, do not hesitate to spin up one thread per each "slow" network-I/O target as the processing will benefit from the just by a coincidence created masking of the "long" waiting-times at a cost of just a cheap ("fast") and (due to "many" and "long" waiting times) rather sparse thread-switching or even the O/S-driven process-scheduler mapping the full sub-processes onto a free CPU-core.


So, should the herd of to-be distributed workloads is some mix of the above cases, there is no other way than to experiment on the actual hardware local / non-local resources.

Why ?
Because there is no rule of thumb to fine-tune the mapping of the workload processing onto the actual CPU-core resources.


Still,
one may easily find to have paid way more than ever getting back
The known trap
of achieving a SlowDown, instead of a ( just wished to get ) SpeedUp

In all cases, the overhead-strict, resources-aware and atomicity of workload respecting revised Amdahl's Law identifies a point-of-diminishing returns, after which any more workers ( CPU-core-s ) will not improve the wished to get Speedup. Many surprises of getting S << 1 are expressed in Stack Overflow posts, so one may read as many of what not to do (learning by anti-patterns) as one may wish.

halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92