I'm trying to see if Dask would be a suitable addition to my project and wrote some very simple test cases to look into it's performance. However, Dask is taking a relatively long time to simply perform the lazy initialization.
@delayed
def normd(st):
return st.lower().replace(',', '')
@delayed
def add_vald(v):
return v+5
def norm(st):
return st.lower().replace(',', '')
def add_val(v):
return v+5
test_list = [i for i in range(1000)]
test_list1 = ["AeBe,oF,221e"]*1000
%timeit rlist = [add_val(y) for y in test_list]
#124 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit rlist = [norm(y) for y in test_list1]
#392 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit rlist = [add_vald(y) for y in test_list]
#19.1 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
rlist = [add_vald(y) for y in test_list]
%timeit rlist1 = compute(*rlist, get=dask.multiprocessing.get)
#892 ms ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit rlist = [normd(y) for y in test_list1]
#18.7 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
rlist = [normd(y) for y in test_list1]
%timeit rlist1 = compute(*rlist, get=dask.multiprocessing.get)
#912 ms ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have looked into Dask For Loop In Parallel and parallel dask for loop slower than regular loop? and I tried increasing size to 1 million items but while the regular loop takes about a second the dask loop never ends. After waiting for half an hour to simply finish lazy initialization of add_vald I killed it.
I'm not sure what's going wrong here and would greatly appreciate any insight you might be able to offer. Thanks!