Pathos, Dask, futures, which one to use for parallel cluster application?

Question

I am confused here. I have an application that is CPU bounded so I went to implementing a parallelisation using multiprocess to overcome GIL issues.

I first tried to use multiprocessing and futures but I faced a pickling issue so I went to pathos which uses dill as a pickle replacement.

Everything is working but I am wondering if I am using the most "future proof" solution. I have seen also dask, but I don't know if it will work in case of pickling classes issues (see Python: (Pathos) Multiprocessing vs. class methods). From the doc, it uses futures so I am assuming that it won't do the job.

Secondly, I would like to be able to use two servers at a time and I have seen that this is possible with pathos (also dask), but I don't understand how exactly this works. This answer https://stackoverflow.com/a/26948258/6522112 only shows how to use one server. How about using 2 or more? I can't find any example about that although it seems possible as described by the package info.

Thanks for your help!

Here is a great introduction how to set up a distributed cluster, that can be used as dask scheduler on different computing enivonments: https://www.youtube.com/watch?v=uQro_CaP9Fo. However, the API has changed over time, so i recommend to have a look at https://distributed.readthedocs.io/en/latest/quickstart.html. How to use the distributed scheduler in dask is described here: http://dask.pydata.org/en/latest/scheduler-overview.html — Arco Bast, Oct 21 '16 at 20:51
I went exact along the same path (from multiprocessing to dask), so maybe my experience might be of use. So far, I try to use dask as much as possible, since it provides more than plain multiprocessing. If this functionality is relevant for your application, give it a try. However, I sometimes ran into strange performance issues, that I was not able to track down, despite the very responsive community around dask. I am a beginner in python and you might be much faster than me in debugging those problem. In the rare were dask does not work as expected, I am currently falling back on pathos. — Arco Bast, Oct 21 '16 at 21:28
You may also want to do some reading on `gevent,` `tornado,` and `AsyncIO.` See http://stackoverflow.com/questions/40166757/speeding-up-urlib-urlretrieve/40187270#40187270. — boardrider, Oct 22 '16 at 01:00
@ArcoBast: so you're saying that `dask` would be the way to go. I'll see the doc. — tupui, Oct 22 '16 at 09:47
So did you get a conclusion yet? I also heard someone tried protobuf for pickling so maybe that's another solution? — user2189731, Mar 16 '18 at 05:52

Pathos, Dask, futures, which one to use for parallel cluster application?

0 Answers0