Pythons parallel processing

Question

I am in the following setting: I have a method that takes an objective function f as input. As a subrouting of that method i want to evaluate f on a small set of points. Since f has high complexity i considered doing that in parallel. All online examples hang up even for trivial functions like squaring on sets with 5 points. They are using the multiprocessing library - and i don't know what i am doing wrong. I am not sure how to encapsulate that __name__ == "__main__" statement in my method. (since it is part of a module - i guess instead of "__main__" i should use the module name?)

Code i have been using looks like

from multiprocessing.pool import Pool
from multiprocessing import cpu_count

x = [1,2,3,4,5]
num_cores = cpu_count()
def f(x):
    return x**2

if __name__ == "__main__":
    pool = Pool(num_cores)
    y = list(pool.map(f, x))
    pool.join()
    print(y)

When executing this code in my spyder it takes a bloody long time to finish.

So my main questions are: What am i doing wrong in this code? How can i encapsulate the __name__-statement, when this code is part of a bigger method? Is it even worth it parallelizing this? (one function evaluation can take multiple minutes and in serial this adds up to a total runtime of hours...)

score 1 · Answer 1 · answered Sep 21 '18 at 08:58

1

According to documentation :

close()

Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.

terminate()

Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected

terminate() will be called immediately.

join()

Wait for the worker processes to exit. One must call close() or terminate() before using join().

So you should add :

from multiprocessing.pool import Pool
from multiprocessing import cpu_count

x = [1,2,3,4,5]

def f(x):
    return x**2

if __name__ == "__main__":
    pool = Pool()
    y = list(pool.map(f, x))
    pool.close()
    pool.join()
    print(y)

You can call Pool without any argument and it will use cpu_count by default

If processes is None then the number returned by cpu_count() is used

About the if name == "main", read more informations here.

So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by name == 'main'

answered Sep 21 '18 at 08:58

Corentin Limier

4,946
1
13
24

Thanks for the explanation. It still just calculates for ever without coming to an end even tho i call close() first.. My question on the name part was more of the nature: If i have the code i posted above inside another method that is part of a module, do i replace "__main__" with the module name then? – Sep 21 '18 at 09:04
You should not put if name == "__main__" inside any method. Be sure to understand why you need this line sometimes and where you should put it (https://docs.python.org/3/library/__main__.html) – Corentin Limier Sep 21 '18 at 09:12
Are you using windows? – Sep 21 '18 at 09:14
My question is not about me considering to put \__name__ == "\__main__" inside a function but rather \__name__ == . – Sep 21 '18 at 09:15
How do you launch your script ? Works on windows with python2.7 when I write my program into a file and executing it inside a windows console. – Corentin Limier Sep 21 '18 at 09:25

score 0 · Answer 2 · answered Sep 21 '18 at 08:55

You might want to look into the chunksize argument of the map function that you are using.

On a large enough input list, a lot of your time is spent simply communicating the arguments to and from the separate parallel processes.

One symptom of this problem is that when you use something like htop all cores are firing but at < 100%.

Pythons parallel processing

2 Answers2