4

I have many many small tasks to do in a for loop. I Want to use concurrency to speed it up. I used joblib for its easy to integrate. However, I found using joblib makes my program run much slower than a simple for iteration. Here is the demo code:

import time
import random
from os import path
import tempfile
import numpy as np
import gc
from joblib import Parallel, delayed, load, dump

def func(a, i):
    '''a simple task for demonstration'''
    a[i] = random.random()

def memmap(a):
    '''use memory mapping to prevent memory allocation for each worker'''
    tmp_dir = tempfile.mkdtemp()
    mmap_fn = path.join(tmp_dir, 'a.mmap')
    print 'mmap file:', mmap_fn
    _ = dump(a, mmap_fn)        # dump
    a_mmap = load(mmap_fn, 'r+') # load
    del a
    gc.collect()
    return a_mmap

if __name__ == '__main__':
    N = 10000
    a = np.zeros(N)

    # memory mapping
    a = memmap(a)

    # parfor
    t0 = time.time()
    Parallel(n_jobs=4)(delayed(func)(a, i) for i in xrange(N))
    t1 = time.time()-t0

    # for 
    t0 = time.time()
    [func(a, i) for i in xrange(N)]
    t2 = time.time()-t0  

    # joblib time vs for time
    print t1, t2

On my laptop with i5-2520M CPU, 4 cores, Win7 64bit, the running time is 6.464s for joblib and 0.004s for simplely for loop.

I've made the arguments as memory mapping to prevent the overhead of reallocation for each worker. I've red this relative post, still not solved my problem. Why is that happen? Did I missed some disciplines to correctly use joblib?

Community
  • 1
  • 1
nn0p
  • 1,189
  • 12
  • 28
  • 2
    If the single-process takes only 0.004s, I suspect there's not enough work to justify the setup overhead of multiprocessing (upon which joblib sits). Did you try any benchmarks involving larger work loads? For example, a scenario that might take a single process 2 or 2 minutes? – FMc Jun 24 '14 at 09:28
  • @FMc Actually, I tested some heavier work in each iteration and that makes the the running time of joblib more close to `for` loop.However, the problem I face is the opposite, and want to find a simple way to speed it up. I considered using threading, however python has GIL. – nn0p Jun 24 '14 at 10:23

1 Answers1

6

"Many small tasks" are not a good fit for joblib. The coarser the task granularity, the less overhead joblib causes and the more benefit you will have from it. With tiny tasks, the cost of setting up worker processes and communicating data to them will outweigh any any benefit from parallelization.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks, larsmans.Yes, large amount of tiny tasks makes too many worker scheduling overhead.Is there any other method or module to make the process faster? Most of the time, my CPU occupation rate is only 10%~20%, that's a waste of my laptop's computational ability. Hope it not involve too much work to rewriting my code. – nn0p Jun 24 '14 at 10:17
  • I was having this problem so I tried a demo with a small number of tasks: def compute(upperBound): output = [] for i in range(int(upperBound)): output.append(sqrt(i**2)) return output if __name__ == "__main__": upperBound = 1e7 numerOfRuns = 6 #output = Parallel(n_jobs=1)(delayed(compute)(upperBound) for i in range(numerOfRuns)) # 19 seconds output = Parallel(n_jobs=2)(delayed(compute)(upperBound) for i in range(numerOfRuns)) # 2 minutes! Can anyone explain this? – David Doria Jul 31 '14 at 13:52
  • @larsmans SOrry about that - this should help: https://gist.github.com/daviddoria/a8a50e63e1483ea9eab2 – David Doria Jul 31 '14 at 14:21
  • @DavidDoria Probably communication overhead. Returning large structures isn't a very good idea either with joblib (although for NumPy arrays it's pretty well optimized). – Fred Foo Jul 31 '14 at 15:07