0

I'm using

dask.compute(*delayeds, scheduler='processes', num_workers=4)

to run computations in parallel.

However I hit a problem to retrieve computation result since the object size returned is more than 4GB. The pickle protocol by default in multiprocessing is 3 and 4GB is its limit.

I'd like to know if it's possible to change the protocol to 4.

I have found some hint in How to change the serialization method used by the multiprocessing module? but it does not seems to work on windows.

Thanks

xavart
  • 122
  • 6

1 Answers1

0

Before answering the specific question, a couple of notes:

  • you should probably not use multiprocessing, but rather the distributed scheduler, which is more moderns and capable, and has a very well defined, pluggable serialisation protocol

  • sending 4GB to/from workers seems very much like an anti-pattern; is there no way that you can load the data into the workers directly, and writing or aggregating before retrieval, avoiding the serialisation problem altogether?

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Actually my code worked well for my problem and using dask multiprocess was the less intrusive way to increase speed almost proportionnal to the number of process. It worked well but we tried to see the limit of the code when the problem grow up. We know we will hit RAM pb memory at a certain point. But we have hitted this serialisation pb since I have 64Go Ram. I thought the easiest and fastest way to increase pb size was to leverage this serialisation pb was to pass from protocol 3 to 4 in pickle. – xavart Oct 19 '19 at 09:08
  • I strongly encourage you to consider avoiding the serialisation, which must be costing you significant time and CPU as well as memory – mdurant Oct 19 '19 at 16:05