1

I have the following python script, where I create a dask dataframe using an existing pandas dataframe. I'm using the multiprocessing scheduler, since my function use pure python. The scheduler creates 8 processes (one for each partition) but they are running sequentially, one at a time.

dask_data = ddf.from_pandas(data, npartitions=8)

dask_data = dask_data.assign(
    images_array_1=dask_data.images_array_1.apply(lambda x: [] if x == "" else [int(el) for el in x.split(',')], name='images_array_1'),
    images_array_2=dask_data.images_array_2.apply(lambda x: [] if x == "" else [int(el) for el in x.split(',')], name='images_array_2')
)
dask_data.compute(get=dask.multiprocessing.get)

I'm using dask only to parallelize the computation, my dataset is small enough to stay in main memory.

Is it possible to run every process in parallel?

Muller20
  • 71
  • 1
  • 6
  • How do you know that they're running only one at a time? – MRocklin Jun 29 '16 at 15:22
  • @MRocklin Windows task manager shows only one process with CPU usage above 0% – Muller20 Jun 29 '16 at 17:01
  • I suspect that your computation is bound by moving the dataframes between processes, and not by actual computations. You could try the [distributed](http://distributed.readthedocs.org) scheduler on a single machine which would handle data locality a bit better. However, as is usually the case when using `apply` the real solution may be to find some algorithm within Pandas to do the work for you at C speeds rather than Python speeds. A single core running fast code may be much faster than eight cores running slow Python code. – MRocklin Jun 29 '16 at 22:20
  • Ok thanks, that's it. Is there any way to allow memory mapping of dataframe partitions? joblib does it and for some reason I assumed that dask also did it. – Muller20 Jun 30 '16 at 20:54
  • Possible duplicate of [python dask DataFrame, support for (trivially parallelizable) row apply?](http://stackoverflow.com/questions/31361721/python-dask-dataframe-support-for-trivially-parallelizable-row-apply) – Rocketq Aug 08 '16 at 12:18

1 Answers1

0

You need to do map_partitions before the apply to be able to run it in parallel.

user1880062
  • 85
  • 1
  • 5