Update
I am running the program with different values of fibonacci_n
(the greater the number the more CPU and more memory required by each task) under Windows 10 with 8 logical cores and 4 physical cores
import multiprocessing
import psutil
from timeit import default_timer as timer
def get_fibonacci(n):
if n <= 1:
return n
else:
return(get_fibonacci(n-1) + get_fibonacci(n-2))
if __name__ == '__main__':
#fibonacci_n = 50
fibonacci_n = 35
parallel_tasks = 60
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
start = timer()
for _ in range(parallel_tasks):
pool.apply_async(get_fibonacci, args=(fibonacci_n,))
# wait for all submitted tasks to complete:
pool.close()
pool.join()
end = timer()
print(end - start)
with multiprocessing.Pool(psutil.cpu_count(logical=False)) as pool:
start = timer()
for _ in range(parallel_tasks):
pool.apply_async(get_fibonacci, args=(fibonacci_n,))
# wait for all submitted tasks to complete:
pool.close()
pool.join()
end = timer()
print(end - start)
Prints:
fibonacci_n = 5
0.1943114
0.13066280000000002
fibonacci_n = 10
0.20109129999999997
0.15671350000000006
fibonacci_n = 15
0.26346709999999995
0.11919980000000008
fibonacci_n = 20
0.18018960000000006
0.21027790000000013
fibonacci_n = 25
0.6708449999999999
0.6243517999999999
fibonacci_n = 30
4.2843897
5.553950199999999
fibonacci_n = 35
46.9901277
61.3082843
For smaller values of fibonacci_n
using all of the logical cores appears to hit performance but as fibonacci_n
grows larger than using the logical cores appears to improve performance.
Running the same program under the Linux (Debian) Subsystem for Windows on the same machine produces the following interesting results:
fibonacci_n = 5
0.007855700000000354
0.011495699999997555
fibonacci_n = 10
0.010983499999998259
0.01145039999999753
fibonacci_n = 15
0.011586099999998822
0.012910300000001484
fibonacci_n = 20
0.04558630000000008
0.060103200000000356
fibonacci_n = 25
0.4328588999999994
0.5055137999999992
fibonacci_n = 30
4.7444106999999995
5.7841524000000035
fibonacci_n = 35
51.236037499999995
63.57373949999999
For smaller values of fibonacci_n
Linux is clearly outperforming Windows but then it seem Windows might have a slight advantage.
I then modified the program to use the a single call to Pool.map
rather than 60 calls to Pool.apply_async
. The big difference here is that the former will by default submit tasks to the "multiprocessing task queue" in "chunks" whose size is a function of the size of the iterable being passed and the pool size. The net result is that in general we can expect cross-address space writes, which ends up being a not insignificant overhead when the worker function itself is not particularly CPU-intensive, which is the case for small values of fibonacci_n
.
The program:
import multiprocessing
import psutil
from functools import partial
from timeit import default_timer as timer
def get_fibonacci(n):
if n <= 1:
return n
else:
return(get_fibonacci(n-1) + get_fibonacci(n-2))
def do_fibonacci(fibonacci_n, index):
return get_fibonacci(fibonacci_n)
if __name__ == '__main__':
parallel_tasks = 60
for fibonacci_n in (5, 10, 15, 20, 25, 30):
print('fibonacci_n =', fibonacci_n)
worker = partial(do_fibonacci, fibonacci_n)
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
start = timer()
pool.map(worker, range(parallel_tasks))
end = timer()
print(end - start)
pool.close()
pool.join()
with multiprocessing.Pool(psutil.cpu_count(logical=False)) as pool:
start = timer()
pool.map(worker, range(parallel_tasks))
end = timer()
print(end - start)
pool.close()
pool.join()
print()
Prints on Windows:
fibonacci_n = 5
0.1474412
0.11148330000000001
fibonacci_n = 10
0.1638636
0.11453170000000001
fibonacci_n = 15
0.1668327999999999
0.11686029999999992
fibonacci_n = 20
0.20729470000000005
0.17455390000000004
fibonacci_n = 25
0.5367786999999997
0.5821648000000001
fibonacci_n = 30
4.138135999999999
5.747972199999999
You can see that all the times are reduced a bit. And on Linux:
fibonacci_n = 5
0.0040933999998742365
0.0020180000001346343
fibonacci_n = 10
0.004675099999985832
0.003311099999336875
fibonacci_n = 15
0.00697059999947669
0.007015600000158884
fibonacci_n = 20
0.05369629999950121
0.042770500000187894
fibonacci_n = 25
0.42763790000026347
0.519938400000683
fibonacci_n = 30
4.619194300000345
5.900994900000114
I apologize that I don't have these values in a table where it would be easier to compare. But there is a conclusion:
Conclusion
There are many things that contribute to multiprocessing performance. There is the overhead of creating the pool processes (not included in the above benchmark), the overhead of submitting tasks and getting results back that depend on what methods are used and what chunksize argument values are used, what platform, etc. And it should be clear for tasks that are not particularly CPU-intensive, the various overheads do not compensate for any gains realized by parallelism. But as tasks become more and more CPU-intensive, using all the logical cores appears to improve performance rather than hinder performance. And, no, I cannot explain your results if they differ from these. But it seems logical to me that for tasks that are purely CPU, you cannot gain anything by having a pool size greater than the number of logical cores you have. If, however, you have a mix of CPU and I/O, that is another story.