I have the following python script, where I create a dask dataframe using an existing pandas dataframe. I'm using the multiprocessing scheduler, since my function use pure python. The scheduler creates 8 processes (one for each partition) but they are running sequentially, one at a time.
dask_data = ddf.from_pandas(data, npartitions=8)
dask_data = dask_data.assign(
images_array_1=dask_data.images_array_1.apply(lambda x: [] if x == "" else [int(el) for el in x.split(',')], name='images_array_1'),
images_array_2=dask_data.images_array_2.apply(lambda x: [] if x == "" else [int(el) for el in x.split(',')], name='images_array_2')
)
dask_data.compute(get=dask.multiprocessing.get)
I'm using dask only to parallelize the computation, my dataset is small enough to stay in main memory.
Is it possible to run every process in parallel?