CPU-limited multiprocessing on python Pool no where near expected speed up

Question

UPDATED WITH SOLUTION

I am having a hard time understanding Pool.

I'd like to run an analysis on 12 independent sets of data at once. The individual analyses do not dependent on each other, don't share data, so I expect a near 12x increase in speed if I can run these in parallel.

However, using Pool.map, I get no where near such performance. To try to create a situation where I expect a near 12x sped up, I wrote a really simple function that consists of a for loop and just calculates arithmetic based on the loop variable. No results are stored and no data is loaded. I've done this because another thread on here talked of L2 cache limiting performance, so I've tried to pare down the problem to one where there's no data, just pure computation.

import multiprocessing as mp
import mp_cfg as _cfg
import os
import time as _tm


NUM_CORE = 12         # set to the number of cores you want to use
NUM_COPIES_2_RUN = 12 # number of times we want to run the function
print("NUM_CORE       %d" % NUM_CORE)
print("NUM_COPIES     %d" % NUM_COPIES_2_RUN)

####################################################
###############################  FUNCTION DEFINITION
####################################################
def run_me(args):
    """
    function to be run NUM_COPIES_2_RUN times  (identical)
    """
    num = args[0]
    tS  = args[1]

    t1 = _tm.time()
    for i in range(5000000):
        v = ((i+i)*(i*3))/100000.

    t2 = _tm.time()
    print("work %(wn)d  %(t2).3f - %(t1).3f  = %(dt).3f" % {"wn" : num, "t1" : (t1-tS), "t2" : (t2-tS), "dt" : (t2-t1)})        

####################################################
##################################  serial execution
####################################################
print("Running %d copies of the same code in serial execution" % NUM_COPIES_2_RUN)
tStart_serial = _tm.time()

for i in range(NUM_COPIES_2_RUN):
    run_me([i, tStart_serial])

tEnd_serial   = _tm.time()

print("total time:  %.3f" % (tEnd_serial - tStart_serial))

####################################################
##############################################  Pool
####################################################
print("Running %d copies of the same code using Pool.map_async" % NUM_COPIES_2_RUN)
tStart_pool   = _tm.time()

pool = mp.Pool(NUM_CORE)
args = []
for n in range(NUM_COPIES_2_RUN):
    args.append([n, tStart_pool])

pool.map_async(run_me, args)
pool.close()
pool.join()

tEnd_pool     = _tm.time()    

print("total time:  %.3f" % (tEnd_pool - tStart_pool))

When I run this on my 16 core Linux machine, I get (param set #1)

NUM_CORE       12
NUM_COPIES     12
Running 12 copies of the same code in serial execution
work 0  0.818 - 0.000  = 0.818
work 1  1.674 - 0.818  = 0.855
work 2  2.499 - 1.674  = 0.826
work 3  3.308 - 2.499  = 0.809
work 4  4.128 - 3.308  = 0.820
work 5  4.937 - 4.128  = 0.809
work 6  5.747 - 4.937  = 0.810
work 7  6.558 - 5.747  = 0.811
work 8  7.368 - 6.558  = 0.810
work 9  8.172 - 7.368  = 0.803
work 10  8.991 - 8.172  = 0.819
work 11  9.799 - 8.991  = 0.808
total time:  9.799
Running 12 copies of the same code using Pool.map
work 1  0.990 - 0.018  = 0.972
work 8  0.991 - 0.019  = 0.972
work 5  0.992 - 0.019  = 0.973
work 7  0.992 - 0.019  = 0.973
work 3  1.886 - 0.019  = 1.867
work 6  1.886 - 0.019  = 1.867
work 4  2.288 - 0.019  = 2.269
work 9  2.290 - 0.019  = 2.270
work 0  2.293 - 0.018  = 2.274
work 11  2.293 - 0.023  = 2.270
work 2  2.294 - 0.019  = 2.275
work 10  2.332 - 0.019  = 2.313
total time:  2.425

When I change parameters (param set #2) and run again, I get

NUM_CORE       12
NUM_COPIES     6
Running 6 copies of the same code in serial execution
work 0  0.798 - 0.000  = 0.798
work 1  1.579 - 0.798  = 0.780
work 2  2.355 - 1.579  = 0.776
work 3  3.131 - 2.355  = 0.776
work 4  3.908 - 3.131  = 0.777
work 5  4.682 - 3.908  = 0.774
total time:  4.682
Running 6 copies of the same code using Pool.map_async
work 1  0.921 - 0.015  = 0.906
work 4  0.922 - 0.015  = 0.907
work 2  0.922 - 0.015  = 0.908
work 5  0.932 - 0.015  = 0.917
work 3  2.099 - 0.015  = 2.085
work 0  2.101 - 0.014  = 2.086
total time:  2.121

Using another set of parameters (param set #3),

NUM_CORE       4
NUM_COPIES     12
Running 12 copies of the same code in serial execution
work 0  0.784 - 0.000  = 0.784
work 1  1.564 - 0.784  = 0.780
work 2  2.342 - 1.564  = 0.778
work 3  3.121 - 2.342  = 0.779
work 4  3.901 - 3.121  = 0.779
work 5  4.682 - 3.901  = 0.782
work 6  5.462 - 4.682  = 0.780
work 7  6.243 - 5.462  = 0.780
work 8  7.024 - 6.243  = 0.781
work 9  7.804 - 7.024  = 0.780
work 10  8.578 - 7.804  = 0.774
work 11  9.360 - 8.578  = 0.782
total time:  9.360
Running 12 copies of the same code using Pool.map_async
work 3  0.862 - 0.006  = 0.856
work 1  0.863 - 0.006  = 0.857
work 5  1.713 - 0.863  = 0.850
work 4  1.713 - 0.863  = 0.851
work 0  2.108 - 0.006  = 2.102
work 2  2.112 - 0.006  = 2.106
work 6  2.586 - 1.713  = 0.873
work 7  2.587 - 1.713  = 0.874
work 8  3.332 - 2.109  = 1.223
work 9  3.333 - 2.113  = 1.220
work 11  3.456 - 2.587  = 0.869
work 10  3.456 - 2.586  = 0.870
total time:  3.513

This has me totally baffled. Especially for parameter set #2, I'm allowing the use of 12 cores for 6 independent threads of execution, yet my speed up is only 2x.

What is going on? I've also tried using map() and map_async(), but there seems to be no difference in performance.

UPDATE:

So there were several things going on here:

1) I had fewer cores than I realized. I thought I had 16 cores, I only had 8 physical cores, and 16 logical cores because hyper-threading was turned on.

2) Even IF I only had say 4 independent processes I wanted to run on these 8 physical cores, I was not getting the expected speed up. I was expecting something like 3.5x in this case. I would get that much speed up maybe 10% of the time when I ran the above tests multiple number of times. Other times, I'd get anywhere from 1.5x to 3.5x - which seemed odd, because I had more than enough cores to do calculations, but most of the time, it'd seem the parallelization is working very sub-optimally. This would make sense if I also had lots of other processes on the system, but I am the only user and I had nothing computationally intensive running.

3) It turns out that having hyper-threading turned on causes this seeming under-utilization of my hardware. If I turn off hyper-threading

https://www.golinuxhub.com/2018/01/how-to-disable-or-enable-hyper.html

I would get the expected ~3.5x speed up every time I ran the script posted above - which is what I expect.

PS) Now, my actual code that does my analysis is written in python with the numerically intensive portions written using cython. It also uses numpy. My numpy is linked to the math kernel library (MKL), which can take advantage of multiple cores. In cases like mine where multiple independent processes need to be run in parallel, it doesn't make sense to have MKL use multiple cores, thereby interrupting the running thread on a different core, especially since the calls to things like dot wasn't sufficiently expensive enough to overcome the overhead of using multiple cores.

I thought that perhaps this was the problem originally:

Limit number of threads in numpy

export MKL_NUM_THREADS=1

did improve performance somewhat, but it wasn't as much as I had hoped, prompting me to ask this question here (and for simplicity, I avoided using numpy altogether).

Hi Magnus. mp.cpu_count() does indeed return me 16 - but you're right. I seem to remember something about it actually having 8 physical cores - now that you mention it. And I was the one that ordered the machine :) - but when that was quite a while ago. When I do a "lscpu", it tells me I have 8 cores per socket and 2 threads per core. Does this mean I actually only have 8 physical cores? — ken, Nov 28 '19 at 07:52
Sorry for deleting my previous comment: I wrote a new shorter comment, on the answer below. — Wololo, Nov 28 '19 at 08:06
No worries - thanks. I think the # of cores is definitely lower than what I've been thinking all this time. But I also find even at 4 procs / 4 CPUs, my results tend to be rather poor, with occassional being as good as > 3x speed up. I will have to compare this to OpenMP parallelism to see if this is a python multiprocessing probem or not. I am doing this to write real-time software that analyzes lots of data coming in. In the actual experimental system, I should have access to real-time computer hardware. Thanks! — ken, Nov 28 '19 at 08:14
Yeah, test that! I do feel quite confident however, that this is indeed a result of the OS putting your threads to sleep and awakens them (several times) as it organizes CPU resources. On my i7-9750H (6 phys cores), four threads gives a speed up of about 3: Even though each parallel subrun is almost as fast as each serial subrun, the thread overhead is quite significant. — Wololo, Nov 28 '19 at 08:37
Also, I wouldn't say that *"this is a python multiprocessing probem"*. Rather, I would say that if speed is a considerable issue to you, you should consider implementing your code in a compiled language instead of Python:) You will be a victim of the OS putting your threads to sleep regardless of language, however, some languages may (or will) handle the threading overhead much better. — Wololo, Nov 28 '19 at 08:42
The actual code I'm working on is in python, but all the numerical bits are in cython, cdef'd nogil'd and that part is quite fast. I'd rather not go the full C++ route because numpy offers too many advantages. However, it really seems the main issue I had was the hyperthreading introducing a lot of variability to the performance. Some trials (of the test code I posted) I got nearly expected amount of speed up, and other trials, I got maybe 30% of what I expected - which completely baffled me. Turning off HT, I get consistent results. I will shortly summarize in an EDIT to my original post — ken, Dec 01 '19 at 21:10

David542 · Answer 1 · 2019-11-28T07:24:49.357

2

My guess is you're maxing out cpu in the for loop in:

for i in range(5000000):
    v = ((i+i)*(i*3))/100000.

It seems counter-intuitive that you have 16 cores and it maxes out under that, but what happens when you try a function like time.sleep(1) for each core -- does it take 16s when run serially and 1s when run on each core? If so, then it would seem it comes down to cpu limitations or perhaps the internals of the python Pool library.

Here's an example on my machine using 8 cores, which cuts the time down by 8 using the most straightforward example I can think of:

import time
from multiprocessing import Pool
NUM_TIMES = 8

def func(i):
    time.sleep(1)

# serial
t0=time.time(); [func() for i in range(NUM_TIMES)]; print (time.time() - t0)
# 8.020868062973022

# pool.map
t0=time.time(); Pool(NUM_TIMES).map(func, range(NUM_TIMES)); print (time.time() - t0)
# 1.2892770767211914

edited Nov 28 '19 at 07:24

answered Nov 28 '19 at 07:13

David542

104,438
178
489
842

Thank you for the reply. No, sleep() doesn't seem to change the behavior all that much. We're getting a factor of ~6x, whereas in the previous example, it was about ~4x, so maybe its a smidgen better, but still no where near 12x - which seems intuitive to me. Any further thoughts? – ken Nov 28 '19 at 07:18
``` NUM_CORE 12 NUM_COPIES 12 Running 12 copies of the same code in serial execution work 0 1.837 - 0.000 = 1.837 work 1 3.615 - 1.837 = 1.778 work 2 5.411 - 3.615 = 1.796 work 3 7.191 - 5.411 = 1.780 work 4 8.977 - 7.191 = 1.785 work 5 10.764 - 8.977 = 1.787 work 6 12.556 - 10.764 = 1.793 work 7 14.344 - 12.556 = 1.788 work 8 16.134 - 14.344 = 1.790 work 9 17.923 - 16.134 = 1.789 work 10 19.713 - 17.923 = 1.790 work 11 21.491 - 19.713 = 1.778 total time: 21.491 ``` – ken Nov 28 '19 at 07:22
Running 12 copies of the same code using Pool.map_async
work 1 1.991 - 0.020 = 1.972
work 4 1.997 - 0.020 = 1.977
work 2 2.193 - 0.020 = 2.173
work 7 2.200 - 0.021 = 2.179
work 10 2.821 - 0.021 = 2.800
work 3 2.955 - 0.020 = 2.935
work 0 3.012 - 0.020 = 2.993
work 8 3.093 - 0.021 = 3.072
work 9 3.234 - 0.021 = 3.213
work 11 3.260 - 0.021 = 3.239
work 5 3.276 - 0.020 = 3.256
work 6 3.278 - 0.020 = 3.258
total time: 3.328 – ken Nov 28 '19 at 07:24
@ken I just posted a bare bones example -- is that helpful at all? Note I'm using `Pool.map(...)` just for the most basic implementation, but perhaps you may want to use `Process` or `map_async`. – David542 Nov 28 '19 at 07:25
Thank you David. I've tried the Process and map_async, and they do not seem very different. However, when I ran your example code, I get approximately an 8x speed up, as you suggested. Is it possible to "max out" a CPU? On a multi-core machine, does this mean that not ALL cores can be running at full-speed simultaneously? Or is this something to do with the load on my system due to other processes perhaps? – ken Nov 28 '19 at 07:31
1

My actual code is python with lots of functions written in cython C-code that's being run in parallel via Pool. But I'd have thought running plain-old python as in my original example would be a piece of cake for python. I'm also going to try writing C-code using OpenMP, and see how that scales for a similar example that is purely CPU-limited. Thank you again. – ken Nov 28 '19 at 07:34
@ken cool -- to be honest, I'm not 100% positive about the internals about `Pool`, but it might be a good question to ask on StackOverflow, as I'm sure a lot of people could give good insight into that. – David542 Nov 28 '19 at 07:35
I'll post a follow-up for the C-version and report on the performance. Thanks again! – ken Nov 28 '19 at 07:36
I agree with @David542 . On my Ubuntu Linux machine, with an Intel i7-9759H proc of 6 phys. cores with hyperthreading, Running 6 copies, on 6 cores gives me a speed-up of about 5, when running the above code. – Wololo Nov 28 '19 at 07:54
1

Hi Magnus - your earlier comment seems to have disappeared - but doing an lscpu, I realize that I have an 8-core CPU with 2 threads per core, and 1 CPU socket occupied. Your comment jogged my memory. This is not really a 16-core system, but an 8-core. Now my results make a lot more sense. Thank you both! – ken Nov 28 '19 at 07:58
@ken , yeah, I deleted it because I would write a shorter comment here: .. on my CPU, Each serial subrun takes about 0.500 secs. Each parallel subrun takes about 0.570 secs. When I increase the number of threads to be close to the number of cores, the speed of each of my parallel sub run decreases dramatically ... – Wololo Nov 28 '19 at 08:01
Your OS needs CPU resources for lots of stuff. When you have "too many" threads, they will be put to sleep and awoken several times before completion. I suggest using `multiprocessing.cpu_count()` and setting the number of threads below that .. – Wololo Nov 28 '19 at 08:03
Actually, when I tried using 4 runs on 4 cores, I only get a speed-up of 1.5x, so I'm not approaching your performance. Perhaps my machine isn't running so hot? I have written OpenMP code before (a long, long time ago) and I seem to remember speed up of nearly as many x as I had cores. It could be something is not going correct with my python installation. – ken Nov 28 '19 at 08:03
What CPU do you have? – Wololo Nov 28 '19 at 08:04
What CPU? I don't really know, but its an Intel that's inside a Dell desktop. Running 4 runs on 4 cores a 2nd time yielded a speed up of x3.5 this time. There seems to be quite a bit of variability. – ken Nov 28 '19 at 08:08
On Linux, you may type `less /proc/cpuinfo`, in a terminal window and look at "model name". When I run too many threads, each thread completes in a random time between 1.2 secs and 0.570 secs. It seems to me that the OS puts them to sleep. – Wololo Nov 28 '19 at 08:11
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Hmm, I see. So you're getting similar performance as me then. Good to know! Now I need to go to sleep. Thanks! – ken Nov 28 '19 at 08:15
@ken any luck with your code? Would be excited to hear what ended up happening... – David542 Nov 29 '19 at 19:44
Hi David542 - OK, so 1) my "16 core system" is actually 8 core w/ hyperthreading. 2) My problem was not so much that I don't get the expected speed-up, ie near x num_cores speed up. It turns out I CAN get that kind of speed up - just that there was high variability in the results, ie if I ran a test 10 times, I might get a speed up like that 2 or 3 times, other times I'd get anywhere from 1.5x to 3.5x (assume num_cores = 4). Seems like hyperthreading is the culprit. Turning HT off stabilizes the performance so I'd get about 3.5x every time. I'll write up my conclusions shortly. Thanks! – ken Dec 01 '19 at 18:45

CPU-limited multiprocessing on python Pool no where near expected speed up

1 Answers1