dask dataframe apply not executing in parallel

Question

I have the following python script, where I create a dask dataframe using an existing pandas dataframe. I'm using the multiprocessing scheduler, since my function use pure python. The scheduler creates 8 processes (one for each partition) but they are running sequentially, one at a time.

dask_data = ddf.from_pandas(data, npartitions=8)

dask_data = dask_data.assign(
    images_array_1=dask_data.images_array_1.apply(lambda x: [] if x == "" else [int(el) for el in x.split(',')], name='images_array_1'),
    images_array_2=dask_data.images_array_2.apply(lambda x: [] if x == "" else [int(el) for el in x.split(',')], name='images_array_2')
)
dask_data.compute(get=dask.multiprocessing.get)

I'm using dask only to parallelize the computation, my dataset is small enough to stay in main memory.

Is it possible to run every process in parallel?

@MRocklin Windows task manager shows only one process with CPU usage above 0% — Muller20, Jun 29 '16 at 17:01
I suspect that your computation is bound by moving the dataframes between processes, and not by actual computations. You could try the [distributed](http://distributed.readthedocs.org) scheduler on a single machine which would handle data locality a bit better. However, as is usually the case when using `apply` the real solution may be to find some algorithm within Pandas to do the work for you at C speeds rather than Python speeds. A single core running fast code may be much faster than eight cores running slow Python code. — MRocklin, Jun 29 '16 at 22:20
Ok thanks, that's it. Is there any way to allow memory mapping of dataframe partitions? joblib does it and for some reason I assumed that dask also did it. — Muller20, Jun 30 '16 at 20:54
Possible duplicate of [python dask DataFrame, support for (trivially parallelizable) row apply?](http://stackoverflow.com/questions/31361721/python-dask-dataframe-support-for-trivially-parallelizable-row-apply) — Rocketq, Aug 08 '16 at 12:18

score 0 · Answer 1 · answered Jul 09 '18 at 02:09

0

You need to do map_partitions before the apply to be able to run it in parallel.

answered Jul 09 '18 at 02:09

user1880062

85
1
5

dask dataframe apply not executing in parallel

1 Answers1