0

for the following script (python 3.6, windows anaconda), I noticed that the libraries are imported as many as the number of the processors were invoked. And print('Hello') are also executed multiple same amount of times.

I thought the processors will only be invoked for func1 call rather than the whole program. The actual func1 is a heavy cpu bounded task which will be executed for millions of times.

Is this the right choice of framework for such task?

import pandas as pd
import numpy as np
from concurrent.futures import ProcessPoolExecutor

print("Hello")

def func1(x):
    return x


if __name__ == '__main__':
    print(datetime.datetime.now())    
    print('test start')

    with ProcessPoolExecutor() as executor:
        results = executor.map(func1, np.arange(1,1000))
        for r in results:
            print(r)

    print('test end')
    print(datetime.datetime.now())
casbby
  • 896
  • 7
  • 20

1 Answers1

1

concurrent.futures.ProcessPoolExecutor uses the multiprocessing module to do its multiprocessing.

And, as explained in the Programming guidelines, this means you have to protect any top-level code you don't want to run in every process in your __main__ block:

Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).

... one should protect the “entry point” of the program by using if __name__ == '__main__':

Notice that this is only necessary if using the spawn or forkserver start methods. But if you're on Windows, spawn is the default. And, at any rate, it never hurts to do this, and usually makes the code clearer, so it's worth doing anyway.

You probably don't want to protect your imports this way. After all, the cost of calling import pandas as pd once per core may seem nontrivial, but that only happens at startup, and the cost of running a heavy CPU-bound function millions of times will completely swamp it. (If not, you probably didn't want to use multiprocessing in the first place…) And usually, the same goes for your def and class statements (especially if they're not capturing any closure variables or anything). It's only setup code that's incorrect to run multiple times (like that print('hello') in your example) that needs to be protected.


The examples in the concurrent.futures doc (and in PEP 3148) all handle this by using the "main function" idiom:

def main():
    # all of your top-level code goes here

if __name__ == '__main__':
    main()

This has the added benefit of turning your top-level globals into locals, to make sure you don't accidentally share them (which can especially be a problem with multiprocessing, where they get actually shared with fork, but copied with spawn, so the same code may work when testing on one platform, but then fail when deployed on the other).


If you want to know why this happens:

With the fork start method, multiprocessing creates each new child process by cloning the parent Python interpreter and then just starting the pool-servicing function up right where you (or concurrent.futures) created the pool. So, top-level code doesn't get re-run.

With the spawn start method, multiprocessing creates each new child process by starting a clean new Python interpreter, importing your code, and then starting the pool-servicing function. So, top-level code gets re-run as part of the import.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • the last paragraph is exactly i was after. Best lesson on the differences between a exploratory style python programming and production grade style programming. after reading the programming guideline, i painfully realized the arg object needs to the func1 needs to be pickable. It really makes me think the object needs to be a pure data object. without all the object methods. – casbby Jun 13 '18 at 00:02
  • @casbby If your object is picklable in principle but not out of the box, and it seems like a pain to write the pickler hooks for it, you should definitely take a look at [`dill`](https://pypi.org/project/dill/) to see if it automatically handles it for you. – abarnert Jun 13 '18 at 00:45
  • @casbby Also, the docs for `multiprocessing` are just huge, and the way they're organized is not friendly for first-time readers—there's an great overview, then a more detailed but less readable overview without enough links to the reference, then a bunch of reference info, then programming guidelines after the reference… You really need to sit down and read the whole thing at least once, and it's not fun the first time. – abarnert Jun 13 '18 at 00:47