5

If I try to parallelize a for loop with dask, it ends up executing slower than the regular version. Basically, I just follow the introductory example from the dask tutorial, but for some reason it's failing on my end. What am I doing wrong?

In [1]: import numpy as np
   ...: from dask import delayed, compute
   ...: import dask.multiprocessing

In [2]: a10e4 = np.random.rand(10000, 11).astype(np.float16)
   ...: b10e4 = np.random.rand(10000, 11).astype(np.float16)

In [3]: def subtract(a, b):
   ...:     return a - b

In [4]: %%timeit
   ...: results = [subtract(a10e4, b10e4[index]) for index in range(len(b10e4))]
1 loop, best of 3: 10.6 s per loop

In [5]: %%timeit
   ...: values = [delayed(subtract)(a10e4, b10e4[index]) for index in range(len(b10e4)) ]
   ...: resultsDask = compute(*values, get=dask.multiprocessing.get)
1 loop, best of 3: 14.4 s per loop
mistakeNot
  • 743
  • 2
  • 10
  • 24

1 Answers1

6

Two issues:

  1. Dask introduces about a millisecond of overhead per task. You'll want to ensure that your computations take significantly longer than that.
  2. When using the multiprocessing scheduler data gets serialized between processes, which can be quite expensive. See http://dask.pydata.org/en/latest/setup.html
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • I see. Well there is 10 000 subtractions going on, each takes ~ 1 ms. I was thinking that dask splits up the for loop and sends ~1250 subtractions to separate cpus (8 cores), but this might be me misunderstanding it. Is it splitting up individual subtraction operations for the cpus to process? That would explain, why there is no speedup – mistakeNot Feb 12 '18 at 15:53
  • It does send 1250 subtractions to each cpu (roughly) but there is overhead to each operation. Every delayed call costs you about 1ms. You're encouraged to group things together. For a simple operation like this you'd be encouraged to use something like dask.array, dask.dataframe or dask.bag to group things for you. – MRocklin Feb 12 '18 at 21:20