0

I would like to use dask to parallelize a numbercrunching task.

This task utilizes only one of the cores in my computer.

As a result of that task I would like to add an entry to a DataFrame via shared_df.loc[len(shared_df)] = [x, 'y']. This DataFrame should be populized by all the (four) paralllel workers / threads in my computer.

How do I have to setup dask to perform this?

  • It looks to me the same question you asked on this [comment](https://stackoverflow.com/questions/53320422/how-to-use-pandas-dataframe-in-shared-memory-during-multiprocessing?noredirect=1#comment93542449_53320422) Have a look to my comment for a toy example. Otherwise please share a [mcve](https://stackoverflow.com/help/mcve) for this particular problem. It's not clear to me what `[x, 'y']` are. – rpanai Nov 16 '18 at 14:50

1 Answers1

0

The right way to do something like this, in rough outline:

  • make a function that, for a given argument, returns a data-frame of some part of the total data

  • wrap this function in dask.delayed, make a list of calls for each input argument, and make a dask-dataframe with dd.from_delayed

  • if you really need the index to be sorted and the index to partition along different lines than the chunking you applied in the previous step, you may want to do set_index

Please read the docstrings and examples for each of these steps!

mdurant
  • 27,272
  • 5
  • 45
  • 74