I need to run a series of independent subprocess calls via multiprocessing. I know in advance that some of these jobs will run orders of magnitudes slower than others. In order to make efficient use of processor time, I'd thought it would make sense to queue the slow jobs first. That way I was hoping to avoid the odd case where one of the long running jobs gets scheduled last and the remaining processors are idle. In this particular case, it doesn't matter whether the actual results are yielded in order or not (because the individual processes create files which are only used once all processes completed). The thing that is important though is that the slow jobs get scheduled first/preferentially. My first question is, how do I achieve this?
Even after reading the multiprocessing documentation and this excellent post explaining the differences between map(), map_async() etc. I'm still a bit confused. My second question therefore is: Should I use Pool.map()
instead of Pool.map_async()
in this case, or even something else?
I tried map()
and map_async()
but at first I didn't observe the expected behavior for either. The following example runs processes, which don't do much except creating a file and sleeping.
import subprocess
import multiprocessing
import time
import os
import glob
import sys
NUM_PROCS = 2
def work(cmd):
return subprocess.call(cmd, shell=True)
def generate_cmds(n=10):
for i in xrange(n):
yield "sleep 2 && touch %d.log" % (i)
def main():
assert len(sys.argv)==2, ("missing arg: async=[0|1]")
async = sys.argv[1]=='1'
results = []
pool = multiprocessing.Pool(processes=NUM_PROCS)
if async:
print "map_async()"
p = pool.map_async(work, generate_cmds(), callback=results.extend)
p.wait()# or p.close() and p.join()
else:
print "map()"
results = pool.map(work, generate_cmds())
print "results: %s" % (results)
for f in glob.glob("[0-9]*.log"):
print f, time.strftime('%H:%M:%S',
time.localtime(os.path.getmtime(f)))
if __name__ == "__main__":
main()
I assumed the timestamps of the created files should match the command order in case of map()
since all processes roughly take the same amount of time to complete, but they don't (if I don't play with chunksize). It seems that every other command ran simultaneously:
$ python subp.py 0
map()
results: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
0.log 14:36:57
1.log 14:36:59
2.log 14:36:57
3.log 14:36:59
4.log 14:37:01
5.log 14:37:03
6.log 14:37:01
7.log 14:37:03
8.log 14:37:05
9.log 14:37:07
I know this can be fixed by using a chunksize of 1, but I don't understand why (could someone explain?). My final question is: does this mean I should use map_async(chunksize=1)
for my setting?
Thanks, Andreas
PS: I'm using Python 2.7 in case it matters.