multiprocessing.Pool.map_async() using much more memory than map()

Question

I have a list of objects, clusters, which I compare against each other using itertools.combinations and map():

likelihoods = map(do_comparison, itertools.combinations(clusters, 2))

To speed this up I use multiple processes instead:

from multiprocessing import Pool
pool = Pool(6)
likelihoods = pool.map_async(do_comparison, itertools.combinations(clusters, 2)).get()

For small lists this works great. However, with 16700 objects in clusters (139436650 combinations) the pool.map_async() uses huge amounts of memory and my pc quickly runs out of memory, while map() has no memory problems with it.

My pc runs out of memory before the multiple processes are started, so my guess is that it's still dividing the chunks of data over the different processes. So I tried using chunksize=1, so that it only needs a small part of it, but this did now work.

Are there other methods to let map_async() use less memory?

@JanneKarila `likelihoods = [x for x in pool.imap(do_comparison, itertools.combinations(clusters, 2))]` is better memory wise, but slower. — Niek de Klein, Oct 30 '13 at 12:15
Slower than what? I understood that the `map_async` approach ran out of memory and did not finish. — Janne Karila, Oct 30 '13 at 12:55
It does not finish when `clusters` gets a size of 16000+, but when I time it with smaller `clusters` list `map_async` is faster. — Niek de Klein, Oct 30 '13 at 13:05
See [Combining itertools and multiprocessing?](http://stackoverflow.com/q/7306522/222914) — Janne Karila, Oct 30 '13 at 13:40

multiprocessing.Pool.map_async() using much more memory than map()

0 Answers0