how to parallelize big for loops in python

Question

I just got to Python, and I am still in the steep phase of the learning curve. Thank you for any comments ahead.

I have a big for loop to run (big in the sense of many iterations), for example:

for i in range(10000)
    for j in range(10000)
        f((i,j))

I though that it would be a common question of how to parallelize it, and after hours of search on google I arrived at the solution using "multiprocessing" module, as the following:

pool=Pool()
x=pool.map(f,[(i,j) for i in range(10000) for j in range(10000)])

This works when the loop is small. However, it is really slow if the loop is large, Or sometimes a memory error occurs if the loops are too big. It seems that python would generate the list of arguments first, and then feed the list to the function "f", even using xrange. Is that correct?

So this parallelization does not work for me because I do not really need to store all arguments in a list. Is there a better way to do this? I appreciate any suggestions or references. Thank you.

In short: it doesn't make any sense, because Python uses the Global Interpreter Lock, which prevents multiple threads from running concurrently. You might still have some benefit if you use Jython or IronPython. http://wiki.python.org/moin/GlobalInterpreterLock — Giulio Franco, Aug 29 '13 at 19:26
@GiulioFranco He's using `multiprocessing` though, not `threading`, so the GIL doesn't come in to play. — Silas Ray, Aug 29 '13 at 19:27

abarnert · Accepted Answer · 2013-08-30T17:37:25.980

6

It seems that python would generate the list of arguments first, and then feed the list to the function "f", even using xrange. Is that correct?

Yes, because you're using a list comprehension, which explicitly asks it to generate that list.

(Note that xrange isn't really relevant here, because you only have two ranges at a time, each 10K long; compared to the 100M of the argument list, that's nothing.)

If you want it to generate the values on the fly as needed, instead of all 100M at once, you want to use a generator expression instead of a list comprehension. Which is almost always just a matter of turning the brackets into parentheses:

x=pool.map(f,((i,j) for i in range(10000) for j in range(10000)))

However, as you can see from the source, map will ultimately just make a list if you give it a generator, so in this case, that won't solve anything. (The docs don't explicitly say this, but it's hard to see how it could pick a good chunksize to chop the iterable into if it didn't have a length…).

And, even if that weren't true, you'd still just run into the same problem again with the results, because pool.map returns a list.

To solve both problems, you can use pool.imap instead. It consumes the iterable lazily, and returns a lazy iterator of results.

One thing to note is that imap does not guess at the best chunksize if you don't pass one, but just defaults to 1, so you may need a bit of thought or trial&error to optimize it.

Also, imap will still queue up some results as they come in, so it can feed them back to you in the same order as the arguments. In pathological cases, it could end up queuing up (poolsize-1)/poolsize of your results, although in practice this is incredibly rare. If you want to solve this, use imap_unordered. If you need to know the ordering, just pass the indexes back and forth with the args and results:

args = ((i, j) for i in range(10000) for j in range(10000))
def indexed_f(index, (i, j)):
    return index, f(i, j)
results = pool.imap_unordered(indexed_f, enumerate(args))

However, I notice that in your original code, you're not doing anything at all with the results of f(i, j). In that case, why even bother gathering up the results at all? In that case, you can just go back to the loop:

for i in range(10000):
    for j in range(10000):
        map.apply_async(f, (i,j))

However, imap_unordered may still be worth using, because it provides a very easy way to block until all of the tasks are done, while still leaving the pool itself running for later use:

def consume(iterator):
    deque(iterator, max_len=0)
x=pool.imap_unordered(f,((i,j) for i in range(10000) for j in range(10000)))
consume(x)

edited Aug 30 '13 at 17:37

answered Aug 29 '13 at 19:26

abarnert

354,177
51
601
671

I would still use `xrange` just for the sake of consistency here, and in the event that the `n` in `range(n)` grows at some later point. – Anorov Aug 29 '13 at 19:33
@Anorov: Consistency with _what_? I'd probably use it too… but it isn't going to make much difference. (If n grows, n**2 is going to grow even faster.) – abarnert Aug 29 '13 at 19:35
@abarnet Consistency in that the entire expression is lazy. And you're right in that it would make little difference, but I think it's good to be in the habit of using `xrange` for Python 2.x. – Anorov Aug 29 '13 at 19:36
@Anorov: Agreed that it's a good habit to have, and that's exactly why I'd use it. I was just explaining why it isn't actually relevant to the problem the OP is trying to solve. – abarnert Aug 29 '13 at 19:36
1

@Anorov, thank you! I tried generator instead of list comprehension. It seems that python would still produce a sequence of arguments first and feed it to the function. So I think imap is the ways to go. Also I am sorry for not being clear in the post that I would like to have the results of f(i,j), so both imap_async and apply_sync would work with your trick of passing index as arguments. Thank you again for your help. – user2727768 Aug 30 '13 at 01:55
@user2727768: You're right; `map` and `map_async` actually _do_ make a list of anything without a `__len__` method. (Which makes sense; how else could they automatically pick a `chunksize` for you?) Good catch; let me edit the answer. – abarnert Aug 30 '13 at 17:31
@abarnert, a continuing question: when I use apply_async as you suggested with a callback function, only 5 out of 16 cores are being used. Any ideas? Thank you in advance – user2727768 Sep 18 '13 at 04:04
@user2727768: Good question. It's possible `multiprocessing.cpu_count()` is only returning 4 instead of 16 for some reason (so a default pool only has 4 processes, plus the main process); call it and check. It's also possible that you've some (implicit or explicit) locks somewhere, or that the overhead of pickling/copying/unpickling tasks is the majority of your work, but neither one seems plausible for your example. Or your callback function could be so slow that the main process can't even send the next tasks to the pool fast enough. Without more code, it's hard to be sure. – abarnert Sep 18 '13 at 17:19
@abarnert thank you for your comments. I really appreciate. I saw 16 worker processes so it should not be a problem of cpu lock. The callback sort of calculates the histogram of values returned by the function. There are only 50 bins so I though it would be fast for the callback to decide which bin to put the values in. As you said, it is complicated and hard to see without code, so I started a post. Thank you. – user2727768 Sep 18 '13 at 17:50
@user2727768: I assume you mean "should not be a problem of cpu _count_", not "_lock_", right? If so, that sounds like a good analysis. One more thing I forgot to suggest: Is the main process burning a lot of CPU? If not, that would rule out the callback blocking the apply loop, and various other similar possibilities. – abarnert Sep 18 '13 at 18:13
@abarnert: yes, I meant "count":) I think you are getting close. The main process uses 100% cpu, while the other 4 uses 25% each. Wait, after watching it for a couple of minutes, the main one decreases to 70% and others are almost the same. I am more confused....... – user2727768 Sep 18 '13 at 20:38

how to parallelize big for loops in python

1 Answers1

Linked