This post says
if the body of your loop is simple, the interpreter overhead of the loop itself can be a substantial amount of the overhead
and gives this example to illustrate Parallel.
def convolve_random(size):
''' Convolve two random arrays of length "size" '''
return np.convolve(np.random.random_sample(size), np.random.random_sample(size))
%timeit convolve_random(40000)
1 loops, best of 3: 904 ms per loop
%timeit [convolve_random(40000 + i*1000) for i in xrange(8)]
# In parallel, with 8 jobs
%timeit Parallel(n_jobs=8)(delayed(convolve_random)(40000 + i*1000) for i in xrange(8))
1 loops, best of 3: 8.69 s per loop
1 loops, best of 3: 2.88 s per loop
in this case, is there a way to estimate the Python interpreter overhead of the loop itself?