7

I was running a simple multiprocessing example in my IPython interpreter (IPython 7.9.0, Python 3.8.0) on my MacBook and ran into a strange error. Here's what I typed:

[In [1]: from concurrent.futures import ProcessPoolExecutor

[In [2]: executor=ProcessPoolExecutor(max_workers=1)

[In [3]: def func():
             print('Hello')

[In [4]: future=executor.submit(func)

However, I received the following error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)                                   
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'func' on <module '__main__' (built-in)>

Furthermore, trying to submit the job again gave me a different error:

[In [5]: future=executor.submit(func)                                            
---------------------------------------------------------------------------
BrokenProcessPool                         Traceback (most recent call last)
<ipython-input-5-42bad1a6fe80> in <module>
----> 1 future=executor.submit(func)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/process.py in submit(*args, **kwargs)
    627         with self._shutdown_lock:
    628             if self._broken:
--> 629                 raise BrokenProcessPool(self._broken)
    630             if self._shutdown_thread:
    631                 raise RuntimeError('cannot schedule new futures after shutdown')

BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

As a sanity check, I typed the same (almost) code into a Python file and ran it from the command line (python3 test.py). It worked fine.

Why does IPython have an issue with my test?

EDIT:

Here's the Python file that worked fine.

from concurrent.futures import ProcessPoolExecutor as Executor

def func():
        print('Hello')

if __name__ == '__main__':
        with Executor(1) as executor:
                future=executor.submit(func)
                print(future.result())
Daniel Walker
  • 6,380
  • 5
  • 22
  • 45
  • What is your environment? I ran your code in ipython (7.14) on Ubuntu and it worked fine. I know of multiprocessing issues related to Windows but I don't have a Windows machine to test in. If you are running on Windows, please add that to the question as it might be relevant. – Hannu May 21 '20 at 08:20
  • I'm running on a MacBook. I've added that to the OP. – Daniel Walker May 21 '20 at 15:22
  • I just upgraded iPython to 7.14 and ran it again. Same error. – Daniel Walker May 21 '20 at 15:32

2 Answers2

15

TLDR;

import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor

# create child processes using 'fork' context
executor = ProcessPoolExecutor(max_workers=1, mp_context=mp.get_context('fork'))

This is in-fact caused by python 3.8 on MacOS switching to "spawn" method for creating a child process; as opposed to "fork" which was the default prior to 3.8. Here are some essential differences:

Fork:

  • Clones data and code of the parent process therefore inheriting the state of the parent program.
  • Any modifications made to the inherited variables by the child process does not reflect back on the state of those variables in the parent process. The states are essentially forked from this point (copy-on-write).
  • All the libraries imported in the parent process are available for use in the child processes' context. This also makes this method fast since child processes don't have to re-import libraries (code) and variables (data).
  • This comes with some downsides especially with respect to forking multithreaded programs.
  • Some libraries with C backends like Tensorflow, OpenCV etc are not fork-safe and causes the child process to hang in a non-deterministic manner.

Spawn:

  • Creates a fresh interpreter for the child process without inheriting code or data.
  • Only the necessary data/arguments are sent to the child process. Which means variables, thread-locks, file descriptors etc are not automatically available to the child process -- this avoids hard to catch bugs.
  • This method also comes with some drawbacks — since data/arguments need to be sent to the child process, they must also be pickle-able. Some objects with internal locks/mutex like Queues are not pickle-able and pickling heavier objects like data frames and large numpy arrays are expensive.
  • Unpickling objects on the child process will cause re-import of associated libraries if any. This again adds to time.
  • Since parent code is not cloned into the child process, you will need to use the if __name__ == '__main__' guard while creating a child process. Not doing this will make the child process unable to import code from the parent process (now running as main). This is also why your program works when used with the guard.

If you're mindful of the fact that fork comes with some unpredictable effects caused by either your program or by an imported non-fork-safe library, you can either:

  • (a) globally set the context for multiprocessing to use 'fork' method:
import multiprocessing as mp

mp.set_start_method("fork")

note that this will set the context globally and you or any other imported library will not be able to change this context once it is set.

  • (b) locally set context by using multiprocessing's get_context method:
import multiprocessing as mp
mp_fork = mp.get_context('fork')

# mp_fork has all the attributes of mp so you can do:
mp_fork.Process(...)  
mp_fork.Pool(...)

# using local context will not change global behaviour:
# create child process using global context
# default is fork in < 3.8; spawn otherwise
mp.Process(...)

# most multiprocessing based functionality like ProcessPoolExecutor 
# also take context as an argument:
executor=ProcessPoolExecutor(max_workers=1, mp_context=mp_fork)
sanygeek
  • 316
  • 2
  • 5
11

Ok, finally found out what is going on. The problem is Mac OS - it uses by default "spawn" method to create subprocesses. This is explained here https://docs.python.org/3/library/multiprocessing.html and also the way to change it to fork (though it states fork is unsafe on Mac os).

With spawn method a new Python interpreter is started and your code fed to it. This then tries to locate your function under main but in this case there is no main as there is no program, just interpreted commands.

If you change the start method to fork, your code runs (but note the caveat about this being unsafe)

In [1]: import multiprocessing as mp                                                                                     

In [2]: mp.set_start_method("fork")                                                                                      

In [3]: def func(): 
   ...:     print("foo"); 
   ...:                                               

In [4]: from concurrent.futures import ProcessPoolExecutor                                                               

In [5]: executor=ProcessPoolExecutor(max_workers=1)                                                               

In [6]: future=executor.submit(func)                                                                                     

foo
In [7]:  

I am not sure if the answer is helpful because of the caveat but it explains why it behaves differently when you do have a program (your other attempt) and why it worked fine on Ubuntu - it uses "fork" by default.

Hannu
  • 11,685
  • 4
  • 35
  • 51
  • 1
    A-ha! That did it! Thanks! – Daniel Walker May 21 '20 at 15:44
  • 1
    Just thinking about this a bit more... when using the fork method the order of things is (probably) important. If you define your `func()` after the subprocess is created, the subprocess might not have inherited it and you would get an exception again. In my example I did intuitively what I always have done with multiprocessing and pools for this reason - define the function first and declare the pool later, but I realised in your OP you have a different order. It may or may not work, not sure if ProcessPoolExecutor launches the process at pool declaration or job submission. – Hannu May 22 '20 at 12:03