Eliminating overhead in multiprocessing with pool

Question

I am currently in a situation where I have parallelized code called repeatedly and try to reduce the overhead associated with the multiprocessing. So, consider the following example, which deliberately contains no "expensive" computations:

import multiprocessing as mp
def f(x):
    # toy function
    return x*x

if __name__ == '__main__':
    for x in range(500):
        pool = mp.Pool(processes=2) 
        print(pool.map(f, range(x, x + 50)))
        pool.close()
        pool.join()  # necessary?

This code takes 53 seconds compared to 0.04 seconds for the sequential approach.

First question: do I really need to call pool.join() in this case when only pool.map() is ever used? I cannot find any negative effects from omitting it and the runtime would drop to 4.8 seconds. (I understand that omitting pool.close() is not possible, as we would be leaking threads then.)

Now, while this would be a nice improvement, as a first answer I would probably get "well, don't create the pool in the loop in the first place". Ok, no problem, but the parallelized code actually lives in an instance method, so I would use:

class MyObject:
    def __init__(self):
        self.pool = mp.Pool(processes=2)
    def function(self, x):
        print(self.pool.map(f, range(x, x + 50)))

if __name__ == '__main__':
    my_object = MyObject()
    for x in range(500):
        my_object.function(x)

This would be my favorite solution as it runs in excellent 0.4 seconds.

Second question: should I call pool.close()/pool.join() somewhere explicitly (e.g. in the destructor of MyObject) or is the current code sufficient? (If it matters: it is ok to assume there are only a few long-lived instances of MyObject in my project.)

You don't need to call `pool.join()` since it blocks until all the processes it started to process the iterable argument have finished...and since you won't be calling it, there's no need for the `pool.close()` either. — martineau, Oct 31 '17 at 23:31
pool.close() is necessary, otherwise I get a "too many open files" exception (on Linux) — sourceror, Oct 31 '17 at 23:38
Good to know—and you've answered part of you own question. — martineau, Oct 31 '17 at 23:40

Nir Alfasi · Answer 1 · 2017-10-31T23:39:14.677

0

Of course it takes a long time: you keep allocating a new pool and destroying it for every x.

It will run much faster if instead you do:

if __name__ == '__main__':
    pool = mp.Pool(processes=2) # allocate the pool only once
    for x in range(500):
        print(pool.map(f, range(x, x + 50)))

    pool.close() # close it only after all the requests are submitted 
    pool.join() # wait for the last worker to finish

Try that and you'll see it now works much faster.

Here are links to the docs for join and close:

Once close is called you can't submit more tasks to the pool, and join waits till the last worker finished its job. They should be called in that order (first close then join).

edited Oct 31 '17 at 23:39

answered Oct 31 '17 at 23:32

Nir Alfasi

53,191
11
86
129

The question rather pertains to the aspects of using a pool as an instance attribute, of which I could not find an example anywhere. Otherwise it would be a duplicate of (https://stackoverflow.com/questions/20387510/proper-way-to-use-multiprocessor-pool-in-a-nested-loop). – sourceror Oct 31 '17 at 23:57
Well, I already anticipated your answer in the second part of my post. My object-oriented approach has the same runtime as your solution, but the question is, am I supposed to put a pool.close() or pool.join() somewhere? – sourceror Nov 01 '17 at 13:50
@sourceror the last sentence in the answer - answers that exact question. Second, it's not just about runtime - it's about a simple mistake that once corrected the first approach should have about the same runtime as the second. – Nir Alfasi Nov 01 '17 at 15:57

score 0 · Answer 2 · answered Nov 01 '17 at 00:30

Well, actually you could pass already allocated pool as argument to your object:

class MyObject:
    def __init__(self, pool):
        self.pool = pool

    def function(self, x):
        print(self.pool.map(f, range(x, x + 50)))


if __name__ == '__main__':
    with mp.Pool(2) as pool:
        my_object = MyObject(pool)
        my_second_object = MyObject(pool)

        for x in range(500):
            my_object.function(x)
            my_second_object.function(x)

        pool.close()

I can not find a reason why it might be necessary to use different pools in different objects

That's true, although I would prefer to allow the end users who will be creating instances of MyObject to abstract away from this. — sourceror, Nov 01 '17 at 13:39

Eliminating overhead in multiprocessing with pool

2 Answers2