UPDATED WITH SOLUTION
I am having a hard time understanding Pool.
I'd like to run an analysis on 12 independent sets of data at once. The individual analyses do not dependent on each other, don't share data, so I expect a near 12x increase in speed if I can run these in parallel.
However, using Pool.map, I get no where near such performance. To try to create a situation where I expect a near 12x sped up, I wrote a really simple function that consists of a for loop and just calculates arithmetic based on the loop variable. No results are stored and no data is loaded. I've done this because another thread on here talked of L2 cache limiting performance, so I've tried to pare down the problem to one where there's no data, just pure computation.
import multiprocessing as mp
import mp_cfg as _cfg
import os
import time as _tm
NUM_CORE = 12 # set to the number of cores you want to use
NUM_COPIES_2_RUN = 12 # number of times we want to run the function
print("NUM_CORE %d" % NUM_CORE)
print("NUM_COPIES %d" % NUM_COPIES_2_RUN)
####################################################
############################### FUNCTION DEFINITION
####################################################
def run_me(args):
"""
function to be run NUM_COPIES_2_RUN times (identical)
"""
num = args[0]
tS = args[1]
t1 = _tm.time()
for i in range(5000000):
v = ((i+i)*(i*3))/100000.
t2 = _tm.time()
print("work %(wn)d %(t2).3f - %(t1).3f = %(dt).3f" % {"wn" : num, "t1" : (t1-tS), "t2" : (t2-tS), "dt" : (t2-t1)})
####################################################
################################## serial execution
####################################################
print("Running %d copies of the same code in serial execution" % NUM_COPIES_2_RUN)
tStart_serial = _tm.time()
for i in range(NUM_COPIES_2_RUN):
run_me([i, tStart_serial])
tEnd_serial = _tm.time()
print("total time: %.3f" % (tEnd_serial - tStart_serial))
####################################################
############################################## Pool
####################################################
print("Running %d copies of the same code using Pool.map_async" % NUM_COPIES_2_RUN)
tStart_pool = _tm.time()
pool = mp.Pool(NUM_CORE)
args = []
for n in range(NUM_COPIES_2_RUN):
args.append([n, tStart_pool])
pool.map_async(run_me, args)
pool.close()
pool.join()
tEnd_pool = _tm.time()
print("total time: %.3f" % (tEnd_pool - tStart_pool))
When I run this on my 16 core Linux machine, I get (param set #1)
NUM_CORE 12
NUM_COPIES 12
Running 12 copies of the same code in serial execution
work 0 0.818 - 0.000 = 0.818
work 1 1.674 - 0.818 = 0.855
work 2 2.499 - 1.674 = 0.826
work 3 3.308 - 2.499 = 0.809
work 4 4.128 - 3.308 = 0.820
work 5 4.937 - 4.128 = 0.809
work 6 5.747 - 4.937 = 0.810
work 7 6.558 - 5.747 = 0.811
work 8 7.368 - 6.558 = 0.810
work 9 8.172 - 7.368 = 0.803
work 10 8.991 - 8.172 = 0.819
work 11 9.799 - 8.991 = 0.808
total time: 9.799
Running 12 copies of the same code using Pool.map
work 1 0.990 - 0.018 = 0.972
work 8 0.991 - 0.019 = 0.972
work 5 0.992 - 0.019 = 0.973
work 7 0.992 - 0.019 = 0.973
work 3 1.886 - 0.019 = 1.867
work 6 1.886 - 0.019 = 1.867
work 4 2.288 - 0.019 = 2.269
work 9 2.290 - 0.019 = 2.270
work 0 2.293 - 0.018 = 2.274
work 11 2.293 - 0.023 = 2.270
work 2 2.294 - 0.019 = 2.275
work 10 2.332 - 0.019 = 2.313
total time: 2.425
When I change parameters (param set #2) and run again, I get
NUM_CORE 12
NUM_COPIES 6
Running 6 copies of the same code in serial execution
work 0 0.798 - 0.000 = 0.798
work 1 1.579 - 0.798 = 0.780
work 2 2.355 - 1.579 = 0.776
work 3 3.131 - 2.355 = 0.776
work 4 3.908 - 3.131 = 0.777
work 5 4.682 - 3.908 = 0.774
total time: 4.682
Running 6 copies of the same code using Pool.map_async
work 1 0.921 - 0.015 = 0.906
work 4 0.922 - 0.015 = 0.907
work 2 0.922 - 0.015 = 0.908
work 5 0.932 - 0.015 = 0.917
work 3 2.099 - 0.015 = 2.085
work 0 2.101 - 0.014 = 2.086
total time: 2.121
Using another set of parameters (param set #3),
NUM_CORE 4
NUM_COPIES 12
Running 12 copies of the same code in serial execution
work 0 0.784 - 0.000 = 0.784
work 1 1.564 - 0.784 = 0.780
work 2 2.342 - 1.564 = 0.778
work 3 3.121 - 2.342 = 0.779
work 4 3.901 - 3.121 = 0.779
work 5 4.682 - 3.901 = 0.782
work 6 5.462 - 4.682 = 0.780
work 7 6.243 - 5.462 = 0.780
work 8 7.024 - 6.243 = 0.781
work 9 7.804 - 7.024 = 0.780
work 10 8.578 - 7.804 = 0.774
work 11 9.360 - 8.578 = 0.782
total time: 9.360
Running 12 copies of the same code using Pool.map_async
work 3 0.862 - 0.006 = 0.856
work 1 0.863 - 0.006 = 0.857
work 5 1.713 - 0.863 = 0.850
work 4 1.713 - 0.863 = 0.851
work 0 2.108 - 0.006 = 2.102
work 2 2.112 - 0.006 = 2.106
work 6 2.586 - 1.713 = 0.873
work 7 2.587 - 1.713 = 0.874
work 8 3.332 - 2.109 = 1.223
work 9 3.333 - 2.113 = 1.220
work 11 3.456 - 2.587 = 0.869
work 10 3.456 - 2.586 = 0.870
total time: 3.513
This has me totally baffled. Especially for parameter set #2, I'm allowing the use of 12 cores for 6 independent threads of execution, yet my speed up is only 2x.
What is going on? I've also tried using map()
and map_async()
, but there seems to be no difference in performance.
UPDATE:
So there were several things going on here:
1) I had fewer cores than I realized. I thought I had 16 cores, I only had 8 physical cores, and 16 logical cores because hyper-threading was turned on.
2) Even IF I only had say 4 independent processes I wanted to run on these 8 physical cores, I was not getting the expected speed up. I was expecting something like 3.5x in this case. I would get that much speed up maybe 10% of the time when I ran the above tests multiple number of times. Other times, I'd get anywhere from 1.5x to 3.5x - which seemed odd, because I had more than enough cores to do calculations, but most of the time, it'd seem the parallelization is working very sub-optimally. This would make sense if I also had lots of other processes on the system, but I am the only user and I had nothing computationally intensive running.
3) It turns out that having hyper-threading turned on causes this seeming under-utilization of my hardware. If I turn off hyper-threading
https://www.golinuxhub.com/2018/01/how-to-disable-or-enable-hyper.html
I would get the expected ~3.5x speed up every time I ran the script posted above - which is what I expect.
PS) Now, my actual code that does my analysis is written in python with the numerically intensive portions written using cython. It also uses numpy. My numpy is linked to the math kernel library (MKL), which can take advantage of multiple cores. In cases like mine where multiple independent processes need to be run in parallel, it doesn't make sense to have MKL use multiple cores, thereby interrupting the running thread on a different core, especially since the calls to things like dot wasn't sufficiently expensive enough to overcome the overhead of using multiple cores.
I thought that perhaps this was the problem originally:
Limit number of threads in numpy
export MKL_NUM_THREADS=1
did improve performance somewhat, but it wasn't as much as I had hoped, prompting me to ask this question here (and for simplicity, I avoided using numpy altogether).