1

I had the following code running serially to request a bunch of urls:

# takes 5 hours
data = []
for url in urls:
    data.append(request_url(url))

I then converted it to run in parallel using multiprocessing:

# takes 50 minutes
p = Pool(20)
data = p.map(request_url, urls)

If I wanted to improve the speed on this, how would I then spread this process across multiple servers, what would be a good method to do this?

David542
  • 104,438
  • 178
  • 489
  • 842
  • 2
    There is as yet no "plain vanilla standard" about how best to distribute work among multiple servers. If you totally trust your servers and network, `RabbitMQ` may offer one approach; if you don't, something more solid such as `MapReduce`, see e.g https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/3-MapReduce-for-Python , or its popular dialect `Hadoop`, may free you from substantial "housekeeping work". But it's important that you consider and express what constraints, exactly, you need to satisfy! – Alex Martelli Jan 03 '15 at 03:17

1 Answers1

2

You could use pathos (and it's sister package pyina), to help you figure out exactly how you wanted to distribute the code in parallel.

pathos provides a unified API for parallel processing across threading, multiprocessing, and sockets. The API provides Pool objects which have blocking, non-blocking iterative, and asynchronous map and pipe methods. pyina extends this API to MPI and schedulers like torque and slurm. You can, generally, nest these constructs so you have heterogeneous and hierarchical parallel distributed computing.

You shouldn't need to modify your code at all to use pathos (and pyina).

There are a few examples of this on SO, including these: Python Multiprocessing with Distributed Cluster Using Pathos and Python Multiprocessing with Distributed Cluster

and in the examples directories found in pathos, pyina, and mystic -- found here: https://github.com/uqfoundation

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • I should also mention that `Ipython parallel` and `scoop` are both coming along (catching up?) in their hierarchical parallel computing, and may soon provide equivalent methods to what I mention above. The examples I give use `pathos.pp` (parallel python), while `Ipython` favors `zmq`. I've also used `rpyc` before as an alternative, but didn't find it as lightweight as `pp`. – Mike McKerns Jan 06 '15 at 03:02