7

I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic.

First, is this the correct way to run a for-loop in parallel?

%%time

list_names=['a','b','c','d']
keep_return=[]

@delayed
def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)


total = delayed(sum)(keep_return)
total.compute()

This produced

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 53s

If I run this in serial,

%%time

list_names=['a','b','c','d']
keep_return=[]


def loop_dummy(target):
    for i in range (1000000000):
        pass
    print('passed value is:'+target)
    return(1)


for i in list_names:
    c=loop_dummy(i)
    keep_return.append(c)

it is actually faster.

passed value is:a
passed value is:b
passed value is:c
passed value is:d
Wall time: 1min 49s

I have seen examples where it was stated there is a small amount of overhead for Dask, but this seems to take long enough to justify, no?

My actual for loop involves heavier computation where I build a model for various targets.

B_Miner
  • 1,840
  • 4
  • 31
  • 66

1 Answers1

7

This computation

for i in range(...):
    pass

Is bound by the global interpreter lock (GIL). You will want to use the multiprocessing or dask.distributed Dask backends rather than the default threading backend. I recommend the following:

total.compute(scheduler='multiprocessing')

However, if your actual computation is mostly Numpy/Pandas/Scikit-Learn/Other numeric package code, then the default threading backend is probably the right choice.

More information about choosing between schedulers is available here: http://dask.pydata.org/en/latest/scheduling.html

stardust
  • 177
  • 2
  • 9
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thank you. If I am using those libraries ( which I am) ,and leave scheduler as is, will dask provide any benefit ? Also curious, given my objective of training many models on different independent target variables is there a better way using dask? Happy to create additional question if needed. – B_Miner Jun 30 '18 at 00:28
  • Yes, if you are using mostly numeric libraries then the threaded scheduler is a good choice. If you aren't using those libraries then Dask is still useful, but you should use the multiprocessing scheduler. I recommend reading the documentation page linked to above. If you are looking at training many models then you may want to look at http://dask-ml.readthedocs.io/en/latest/ – MRocklin Jun 30 '18 at 00:50
  • Thanks I'll look. Just to confirm, Dask in my loop should still be faster than serial , even using the default scheduler (i.e. leaving as is) – B_Miner Jun 30 '18 at 00:57
  • That depends on your computation. Parallel programming is not always faser. – MRocklin Jun 30 '18 at 11:30