4

Given a large list (1,000+) of completely independent objects that each need to be manipulated through some expensive function (~5 minutes each), what is the best way to distribute the work over other cores? Theoretically, I could just cut up the list into equal parts and serialize the data with cPickle (takes a few seconds) and launch a new python processes for each chunk--and it may just come to that if I intend to use multiple computers--but this feels like more of a hack than anything. Surely there is a more integrated way to do this using a multiprocessing library? Am I over-thinking this?

Thanks.

SkyNT
  • 718
  • 1
  • 8
  • 19

2 Answers2

6

This sounds like a good use case for a multiprocessing.Pool; depending on exactly what you're doing, it could be as simple as

pool = multiprocessing.Pool(num_procs)
results = pool.map(the_function, list_of_objects)
pool.close()

This will pickle each object in the list independently. If that's a problem, there are various ways to get around that (though all with their own problems and I don't know if any of them work on Windows). Since your computation times are fairly long that's probably irrelevant.

Since you're running this for 5 minutes x 1000 items = several days / number of cores, you probably want to do some saving of partial results along the way and print out some progress information. The easiest thing to do is probably to have your function you call save its results to a file or database or whatever; if that's not practical, you could also use apply_async in a loop and handle the results as they come in.

You could also look into something like joblib to handle this for you; I'm not very familiar with it but it seems like it's approaching the same problem.

Danica
  • 28,423
  • 6
  • 90
  • 122
  • I second the recommendation for `multiprocessing.Pool()`. It's a simple way to put all the cores to work for you, and it has worked well for me on several occasions. – steveha Feb 06 '13 at 07:36
  • Could someone clarify what is returned in `results`? Will this be a list too? Say I am originally running `result[i] = the_function(list_of_objects[i])` in a simple loop through `list_of_objects` – SkyNT Feb 06 '13 at 07:56
  • Yes, it'll be a list, with the same contents as in your regular loop. – Danica Feb 06 '13 at 13:57
1

If you want to run the job on a single computer, use multiprocessing.Pool() as suggested by @Dougal in his answer.

If you would like to get multiple computers working on the problem, Python can do that too. I did a Google search for "python parallel processing" and found this:

Parallel Processing in python

One of the answers recommends "mincemeat", a map/reduce solution in a single 377-line Python source file!

https://github.com/michaelfairley/mincemeatpy

I'll bet, with a little bit of work, you could use multiprocessing.Pool() to spin up a set of mincemeat clients, if you wanted to use multiple cores on multiple computers.

EDIT: I did some more research tonight, and it looks like Celery would be a good choice. Celery will already run multiple workers per machine.

http://www.celeryproject.org/

Celery was recommended here:

https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag

Community
  • 1
  • 1
steveha
  • 74,789
  • 21
  • 92
  • 117