Why do identical jobs run at different speeds with python's multiprocessing.Pool?

Question

I am trying to speed up some heavy simulations by using python's multiprocessing module on a machine with 24 cores that runs Suse Linux. From reading through the documentation, I understand that this only makes sense if the individual calculations take much longer than the overhead for creating the pool etc.

What confuses me is that the execution time of some of the individual processes is much longer with multiprocessing than when I just run a single process. In my actual simulations the the time increases from around 300s to up to 1500s. Interestingly this gets worse when I use more processes.

The following example illustrates the problem with a slightly shorter dummy loop:

from time import clock,time
import multiprocessing
import os


def simulate(params):
    t1 = clock()
    result = 0
    for i in range(10000):
        for j in range(10000):
            result+=i*j
    pid = os.getpid()
    print 'pid: ',pid,' sim time: ',clock() - t1, 'seconds'
    return result

if __name__ == '__main__':


    for n_procs in [1,5,10,20]:
        print n_procs,' processes:'
        t1 = time()
        result = multiprocessing.Pool(processes = n_procs).map(simulate,range(20))
        print 'total: ',time()-t1

This produces the following output:

1  processes:
pid:  1872  sim time:  8.1 seconds
pid:  1872  sim time:  7.92 seconds
pid:  1872  sim time:  7.93 seconds
pid:  1872  sim time:  7.89 seconds
pid:  1872  sim time:  7.87 seconds
pid:  1872  sim time:  7.74 seconds
pid:  1872  sim time:  7.83 seconds
pid:  1872  sim time:  7.84 seconds
pid:  1872  sim time:  7.88 seconds
pid:  1872  sim time:  7.82 seconds
pid:  1872  sim time:  8.83 seconds
pid:  1872  sim time:  7.91 seconds
pid:  1872  sim time:  7.97 seconds
pid:  1872  sim time:  7.84 seconds
pid:  1872  sim time:  7.87 seconds
pid:  1872  sim time:  7.91 seconds
pid:  1872  sim time:  7.86 seconds
pid:  1872  sim time:  7.9 seconds
pid:  1872  sim time:  7.96 seconds
pid:  1872  sim time:  7.97 seconds
total:  159.337743998
5  processes:
pid:  1906  sim time:  8.66 seconds
pid:  1907  sim time:  8.74 seconds
pid:  1908  sim time:  8.75 seconds
pid:  1905  sim time:  8.79 seconds
pid:  1909  sim time:  9.52 seconds
pid:  1906  sim time:  7.72 seconds
pid:  1908  sim time:  7.74 seconds
pid:  1907  sim time:  8.26 seconds
pid:  1905  sim time:  8.45 seconds
pid:  1909  sim time:  9.25 seconds
pid:  1908  sim time:  7.48 seconds
pid:  1906  sim time:  8.4 seconds
pid:  1907  sim time:  8.23 seconds
pid:  1905  sim time:  8.33 seconds
pid:  1909  sim time:  8.15 seconds
pid:  1908  sim time:  7.47 seconds
pid:  1906  sim time:  8.19 seconds
pid:  1907  sim time:  8.21 seconds
pid:  1905  sim time:  8.27 seconds
pid:  1909  sim time:  8.1 seconds
total:  35.1368539333
10  processes:
pid:  1918  sim time:  8.79 seconds
pid:  1920  sim time:  8.81 seconds
pid:  1915  sim time:  14.78 seconds
pid:  1916  sim time:  14.78 seconds
pid:  1914  sim time:  14.81 seconds
pid:  1922  sim time:  14.81 seconds
pid:  1913  sim time:  14.98 seconds
pid:  1921  sim time:  14.97 seconds
pid:  1917  sim time:  15.13 seconds
pid:  1919  sim time:  15.13 seconds
pid:  1920  sim time:  8.26 seconds
pid:  1918  sim time:  8.34 seconds
pid:  1915  sim time:  9.03 seconds
pid:  1921  sim time:  9.03 seconds
pid:  1916  sim time:  9.39 seconds
pid:  1913  sim time:  9.27 seconds
pid:  1914  sim time:  12.12 seconds
pid:  1922  sim time:  12.17 seconds
pid:  1917  sim time:  12.15 seconds
pid:  1919  sim time:  12.17 seconds
total:  27.4067809582
20  processes:
pid:  1941  sim time:  8.63 seconds
pid:  1939  sim time:  10.32 seconds
pid:  1931  sim time:  12.35 seconds
pid:  1936  sim time:  12.23 seconds
pid:  1937  sim time:  12.82 seconds
pid:  1942  sim time:  12.73 seconds
pid:  1932  sim time:  13.01 seconds
pid:  1946  sim time:  13.0 seconds
pid:  1945  sim time:  13.74 seconds
pid:  1944  sim time:  14.03 seconds
pid:  1929  sim time:  14.44 seconds
pid:  1943  sim time:  14.75 seconds
pid:  1935  sim time:  14.8 seconds
pid:  1930  sim time:  14.79 seconds
pid:  1927  sim time:  14.85 seconds
pid:  1934  sim time:  14.8 seconds
pid:  1928  sim time:  14.83 seconds
pid:  1940  sim time:  14.88 seconds
pid:  1933  sim time:  15.05 seconds
pid:  1938  sim time:  15.06 seconds
total:  15.1311581135

What I do not understand is that some of the processes become much slower above a certain number of CPUs. I should add that nothing else is running on this machine. Is this expected? Am I doing something wrong?

One would expect per jobs times to degrade as the number of executors gets close to the total number of cores, depending on system load average of course. My notebook has 2 cores so I can't really experiment at the moment but you could even up the benchmarks a bit by creating the pool before you start timing and by setting map's `chunksize` argument to 1. Run `top` before the test and see how `load average` and memory totals change. — tdelaney, Nov 09 '15 at 20:26

przemo_li · Accepted Answer · 2015-11-10T07:16:52.713

Cores are shared resource like anything else on computer.

OS will usually balance load. Meaning it will spread threads on as many cores as possible.* Guiding metric will be core load.

So if there are less thread counts then core count some cores will sit idle. (Thread architecture prevent splitting onto multiple cores).

If there are more threads then cores. OS will assign many threads to single core, and will multitask between those threads on that core. Switching from one thread to other on single core have some cost associated.

Shifting task from core to another have even greater cost. (Quite significant in terms of both cores resources) OS will generally avoid such actions.

So getting back to Your story.

Performance rouse with thread count up to core count because there where idling cores that got new work. Few last cores though where busy with OS work anyway, so those added very little to actual performance.

Overall performance still improved after thread count passed core count. Just because OS can switch active thread if previous got stuck on long running task (like I/O), so another one can use CPU time.

Perofrmance would decrease if thread count would significantly pass core count. As too many threads would fight for same resource (CPU time), and switching costs would aggregate to substantial portion of CPU cycles. However from Your listing its still not happened.

As for seemingly long execution time? It was long! Just threads did not spent it all working. OS switched them off and on to maximize CPU usage whenever anyone of them got stuck on external work (I/O), and then some more switching to more evenly spread CPU time across threads assigned to core.

* OS may also go for least power usage, maximized I/O usage, etc. Especially Linux is very flexible here. But its out of scope ;) Read on various schedulers in Linux if interested.

score 0 · Answer 2 · edited May 23 '17 at 11:51

This is the best answer I could come up with after looking through different questions and documentations:

It is pretty widely known that multiprocessing in general adds some sort of overhead when it comes to run time performance. This is/can be a result of a lot of different factors such as allocating RAM space, initializing the process, waiting for termination, etc,etc,etc. This then explains the increase in time from switching to parallel processing from singular.

The increase in time as the amount of processes increases can sort of be explained by the way that mutliprocessing works. The comment by ali_m in this link was the best that I could find that explains why this happens:

For starters, if your threads share CPU cache you're likely to suffer a lot more cache misses, which can cause a big degradation in performance

This is is alike to when you try to run a lot of different programs on your computer at once: your programs start to 'lag' and slow down because your CPU can only handle so many requests at a time.

Another good link that I found was this. Although this was a question about SQL servers and using queries, the same idea can be applied to it (regarding the amount of overhead as the amount of processes/queries increase)

This is not, by far, a complete answer but this is my slight understanding on why you are getting the results as you are. Conclusion? the results you are getting or both normal and expected for multiprocessing

score 0 · Answer 3 · answered Nov 09 '15 at 21:37

The answer to this question kind of makes the question redundant. It turns out that the machine has only 12 physical cores that accept two threads each.

The output of multiprocessing.cpu_count() is 24. However lscpu yields that there are only two sockets with six cores each.

This explains why above ten processes, the individual runs become slower.

Why do identical jobs run at different speeds with python's multiprocessing.Pool?

3 Answers3