Multiprocessing isn't exactly a simple library, but once you're familiar with how it works, it's pretty easy to poke around and figure it out.
You usually want to start with context.py. This is where all the useful classes get bound depending on OS, and... well... the "context" you have active. There are 4 basic contexts: Fork, ForkServer, and Spawn for posix; and a separate Spawn for windows. These in turn each have their own "Popen" (called at start()
) to launch a new process to handle the separate implementations.
popen_fork.py
creating a process literally calls os.fork()
, and then in the child organizes to run BaseProcess._bootstrap()
which sets up some cleanup stuff then calls self.run()
to execute the code you give it. No pickling occurs to start a process this way because the entire memory space gets copied (with some exceptions. see: fork(2)).
popen_spawn_xxxxx.py
I am most familiar with windows, but I assume both the win32 and posix versions operate in a very similar manner. A new python process is created with a simple crafted command line string including a pair of pipe handles to read/write from/to. The new process will import the __main__ module (generally equal to sys.argv[0]
) in order to have access to all the needed references. Then it will execute a simple bootstrap function (from the command string) that attempts to read and un-pickle a Process
object from its pipe it was created with. Once it has the Process
instance (a new object which is a copy; not just a reference to the original), it will again arrange to call _bootstrap()
.
popen_forkserver.py
The first time a new process is created with the "forkserver" context, a new process will be "spawn"ed running a simple server (listening on a pipe) which handles new process requests. Subsequent process requests all go to the same server (based on import mechanics and a module-level global for the server instance). New processes are then "fork"ed from that server in order to save the time of spinning up a new python instance. These new processes however can't have any of the same (as in same object and not a copy) Process
objects because the python process they were forked from was itself "spawn"ed. Therefore the Process
instance is pickled and sent much like with "spawn". The benefits of this method include: The process doing the forking is single threaded to avoid deadlocks. The cost of spinning up a new python interpreter is only paid once. The memory consumption of the interpreter, and any modules imported by __main__ can largely be shared due to "fork" generally using copy-on-write memory pages.
In all cases, once the split has occurred, you should consider the memory spaces totally separate, and the only communication between them is via pipes or shared memory. Locks and Semaphores are handled by an extension library (written in c), but are basically named semaphores managed by the OS. Queue
's, Pipe
's and multiprocessing.Manager
's use pickling to synchronize changes to the proxy objects they return. The new-ish multiprocessing.shared_memory
uses a memory-mapped file or buffer to share data (managed by the OS like semaphores).
To address your concern:
the code may have a bug and an object which is supposed to read-only is inadvertently modified, leading to its pickling to be transferred to other processes.
This only really applies to multiprocessing.Manager
proxy objects. As everything else requires you to be very intentional about sending and receiveing data, or instead uses some other transfer mechanism than pickling.