Choosing optimal number of processes for Memory Bounded tasks

Question

I've been researching the usage of multiprocessing.cpu_count and understanding the pros and cons of using it vs. psutil.cpu_count(logical=False) and from what I can gather the difference is dependant on the host system and whether or not it has logical cores.

That's all great and all, but I'm trying to put this to use, and I found a few resources (e.g. this and this) saying that generally speaking, having too many processes can degrade performance due to the OS doing context switches.

I've created a small benchmark of a memory bounded problem that calculates Fibonacci numbers and I'm surprised by the results. I'm not sure I can explain them:

import multiprocessing
import psutil
from timeit import default_timer as timer

def get_fibonacci(n):
    if n <= 1:
        return n
    else:
        return(get_fibonacci(n-1) + get_fibonacci(n-2))

fibonacci_n = 50
parallel_tasks = 60

start = timer()
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
    params = [(fibonacci_n, )] * parallel_tasks
    results = [pool.apply_async(get_fibonacci, p) for p in params]
    full_res = [result.get() for result in results]
end = timer()
print(end - start)


start = timer()
with multiprocessing.Pool(psutil.cpu_count(logical=False)) as pool:
    params = [(fibonacci_n, )] * parallel_tasks
    results = [pool.apply_async(get_fibonacci, p) for p in params]
    full_res = [result.get() for result in results]
end = timer()
print(end - start)

The thing I found strange is that most of the time, there was little to no performance impact. I even tried this experiment with a larger amount of processes (~30) and had even slightly better results. I'm running on MacOS with 8 physical cores, and 16 virtual ones.

I would have expected that the larger the amount of processes, the worst my performance would be. I wonder if I didn't put the system in enough stress and should test with more than 30 processes? or perhaps this only happens in I/O bounded tasks?

Any idea on this behaviour or explanation would be most welcome.

Booboo · Answer 1 · 2022-02-03T23:32:37.757

Update

I am running the program with different values of fibonacci_n (the greater the number the more CPU and more memory required by each task) under Windows 10 with 8 logical cores and 4 physical cores

import multiprocessing
import psutil
from timeit import default_timer as timer

def get_fibonacci(n):
    if n <= 1:
        return n
    else:
        return(get_fibonacci(n-1) + get_fibonacci(n-2))

if __name__ == '__main__':
    #fibonacci_n = 50
    fibonacci_n = 35
    parallel_tasks = 60

    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        start = timer()
        for _ in range(parallel_tasks):
            pool.apply_async(get_fibonacci, args=(fibonacci_n,))
        # wait for all submitted tasks to complete:
        pool.close()
        pool.join()
        end = timer()
        print(end - start)


    with multiprocessing.Pool(psutil.cpu_count(logical=False)) as pool:
        start = timer()
        for _ in range(parallel_tasks):
            pool.apply_async(get_fibonacci, args=(fibonacci_n,))
        # wait for all submitted tasks to complete:
        pool.close()
        pool.join()
        end = timer()
        print(end - start)

Prints:

fibonacci_n =  5
0.1943114
0.13066280000000002

fibonacci_n =  10
0.20109129999999997
0.15671350000000006

fibonacci_n =  15
0.26346709999999995
0.11919980000000008

fibonacci_n =  20
0.18018960000000006
0.21027790000000013

fibonacci_n =  25
0.6708449999999999
0.6243517999999999

fibonacci_n =  30
4.2843897
5.553950199999999

fibonacci_n =  35
46.9901277
61.3082843

For smaller values of fibonacci_n using all of the logical cores appears to hit performance but as fibonacci_n grows larger than using the logical cores appears to improve performance.

Running the same program under the Linux (Debian) Subsystem for Windows on the same machine produces the following interesting results:

fibonacci_n =  5
0.007855700000000354
0.011495699999997555

fibonacci_n =  10
0.010983499999998259
0.01145039999999753

fibonacci_n =  15
0.011586099999998822
0.012910300000001484

fibonacci_n =  20
0.04558630000000008
0.060103200000000356

fibonacci_n =  25
0.4328588999999994
0.5055137999999992

fibonacci_n =  30
4.7444106999999995
5.7841524000000035

fibonacci_n =  35
51.236037499999995
63.57373949999999

For smaller values of fibonacci_n Linux is clearly outperforming Windows but then it seem Windows might have a slight advantage.

I then modified the program to use the a single call to Pool.map rather than 60 calls to Pool.apply_async. The big difference here is that the former will by default submit tasks to the "multiprocessing task queue" in "chunks" whose size is a function of the size of the iterable being passed and the pool size. The net result is that in general we can expect cross-address space writes, which ends up being a not insignificant overhead when the worker function itself is not particularly CPU-intensive, which is the case for small values of fibonacci_n.

The program:

import multiprocessing
import psutil
from functools import partial
from timeit import default_timer as timer

def get_fibonacci(n):
    if n <= 1:
        return n
    else:
        return(get_fibonacci(n-1) + get_fibonacci(n-2))

def do_fibonacci(fibonacci_n, index):
    return get_fibonacci(fibonacci_n)

if __name__ == '__main__':
    parallel_tasks = 60

    for fibonacci_n in (5, 10, 15, 20, 25, 30):
        print('fibonacci_n =', fibonacci_n)
        worker = partial(do_fibonacci, fibonacci_n)

        with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
            start = timer()
            pool.map(worker, range(parallel_tasks))
            end = timer()
            print(end - start)
            pool.close()
            pool.join()


        with multiprocessing.Pool(psutil.cpu_count(logical=False)) as pool:
            start = timer()
            pool.map(worker, range(parallel_tasks))
            end = timer()
            print(end - start)
            pool.close()
            pool.join()

            print()

Prints on Windows:

fibonacci_n = 5
0.1474412
0.11148330000000001

fibonacci_n = 10
0.1638636
0.11453170000000001

fibonacci_n = 15
0.1668327999999999
0.11686029999999992

fibonacci_n = 20
0.20729470000000005
0.17455390000000004

fibonacci_n = 25
0.5367786999999997
0.5821648000000001

fibonacci_n = 30
4.138135999999999
5.747972199999999

You can see that all the times are reduced a bit. And on Linux:

fibonacci_n = 5
0.0040933999998742365
0.0020180000001346343

fibonacci_n = 10
0.004675099999985832
0.003311099999336875

fibonacci_n = 15
0.00697059999947669
0.007015600000158884

fibonacci_n = 20
0.05369629999950121
0.042770500000187894

fibonacci_n = 25
0.42763790000026347
0.519938400000683

fibonacci_n = 30
4.619194300000345
5.900994900000114

I apologize that I don't have these values in a table where it would be easier to compare. But there is a conclusion:

Conclusion

There are many things that contribute to multiprocessing performance. There is the overhead of creating the pool processes (not included in the above benchmark), the overhead of submitting tasks and getting results back that depend on what methods are used and what chunksize argument values are used, what platform, etc. And it should be clear for tasks that are not particularly CPU-intensive, the various overheads do not compensate for any gains realized by parallelism. But as tasks become more and more CPU-intensive, using all the logical cores appears to improve performance rather than hinder performance. And, no, I cannot explain your results if they differ from these. But it seems logical to me that for tasks that are purely CPU, you cannot gain anything by having a pool size greater than the number of logical cores you have. If, however, you have a mix of CPU and I/O, that is another story.

Hi @Booboo, thanks for the comments. `get_fibonacci` and `recur_fibo` are the same function, I just renamed it to make it more clear but missed the other references. However, you are seeing an improvement with more processes which is different than what I'm seeing. Can you explain the difference? — Guy Grin, Feb 03 '22 at 15:20
Well,the first run is getting to use the logical processors. I am on my way out and won't be back for a few hours. I will look at this some more. — Booboo, Feb 03 '22 at 15:55

Choosing optimal number of processes for Memory Bounded tasks

1 Answers1