Determining exactly what is pickled during Python multiprocessing

Question

As explained in the thread What is being pickled when I call multiprocessing.Process? there are circumstances where multiprocessing requires little to no data to be transferred via pickling. For example, on Unix systems, the interpreter uses fork() to create the processes, and objects which already exist when multiprocessing starts can be accessed by each process without pickling.

However, I'm trying to consider scenarios beyond "here's how it's supposed to work". For example, the code may have a bug and an object which is supposed to read-only is inadvertently modified, leading to its pickling to be transferred to other processes.

Is there some way to determine what, or at least how much, is pickled during multiprocessing? The information doesn't necessarily have to be in real-time, but it would be helpful if there was a way to get some statistics (e.g., number of objects pickled) which might give a hint as to why something is taking longer to run than intended because of unexpected pickling overhead.

I'm looking for a solution internal to the Python environment. Process tracing (e.g. Linux strace), network snooping, generalized IPC statistics, and similar solutions that might be used to count the number of bytes moving between processes aren't going to be specific enough to identify object pickling versus other types of communication.

Updated: Disappointingly, there appears to be no way to gather pickling statistics short of hacking up the module and/or interpreter sources. However, @aaron does explain this and clarified a few minor points, so I have accepted the answer.

score 6 · Accepted Answer · answered Aug 28 '21 at 21:10

6

Multiprocessing isn't exactly a simple library, but once you're familiar with how it works, it's pretty easy to poke around and figure it out.

You usually want to start with context.py. This is where all the useful classes get bound depending on OS, and... well... the "context" you have active. There are 4 basic contexts: Fork, ForkServer, and Spawn for posix; and a separate Spawn for windows. These in turn each have their own "Popen" (called at start()) to launch a new process to handle the separate implementations.

popen_fork.py

creating a process literally calls os.fork(), and then in the child organizes to run BaseProcess._bootstrap() which sets up some cleanup stuff then calls self.run() to execute the code you give it. No pickling occurs to start a process this way because the entire memory space gets copied (with some exceptions. see: fork(2)).

popen_spawn_xxxxx.py

I am most familiar with windows, but I assume both the win32 and posix versions operate in a very similar manner. A new python process is created with a simple crafted command line string including a pair of pipe handles to read/write from/to. The new process will import the __main__ module (generally equal to sys.argv[0]) in order to have access to all the needed references. Then it will execute a simple bootstrap function (from the command string) that attempts to read and un-pickle a Process object from its pipe it was created with. Once it has the Process instance (a new object which is a copy; not just a reference to the original), it will again arrange to call _bootstrap().

popen_forkserver.py

The first time a new process is created with the "forkserver" context, a new process will be "spawn"ed running a simple server (listening on a pipe) which handles new process requests. Subsequent process requests all go to the same server (based on import mechanics and a module-level global for the server instance). New processes are then "fork"ed from that server in order to save the time of spinning up a new python instance. These new processes however can't have any of the same (as in same object and not a copy) Process objects because the python process they were forked from was itself "spawn"ed. Therefore the Process instance is pickled and sent much like with "spawn". The benefits of this method include: The process doing the forking is single threaded to avoid deadlocks. The cost of spinning up a new python interpreter is only paid once. The memory consumption of the interpreter, and any modules imported by __main__ can largely be shared due to "fork" generally using copy-on-write memory pages.

In all cases, once the split has occurred, you should consider the memory spaces totally separate, and the only communication between them is via pipes or shared memory. Locks and Semaphores are handled by an extension library (written in c), but are basically named semaphores managed by the OS. Queue's, Pipe's and multiprocessing.Manager's use pickling to synchronize changes to the proxy objects they return. The new-ish multiprocessing.shared_memory uses a memory-mapped file or buffer to share data (managed by the OS like semaphores).

To address your concern:

the code may have a bug and an object which is supposed to read-only is inadvertently modified, leading to its pickling to be transferred to other processes.

This only really applies to multiprocessing.Manager proxy objects. As everything else requires you to be very intentional about sending and receiveing data, or instead uses some other transfer mechanism than pickling.

answered Aug 28 '21 at 21:10

Aaron

10,133
1
24
40

regarding my last little bit. I mostly avoid managers for just this purpose. It is much more clear to always explicitly send and receive data from other processes. For this I almost always use a Queue. For instances when I have a large array of data (image frames, etc...) that would be very inefficient to send and receive all the time, I do sometimes use shared memory, but I'm careful to control access so concurrent parts of the array are not written simultaneously. – Aaron Aug 29 '21 at 00:40
I appreciate the thoroughness, but as I said, I'm trying to quantify what **was** pickled, not intuitively understand what **should** be pickled. Let's say I'm using a queue to pass objects around. The objects already exist, and shouldn't be modified, so the pickling should be minimal. Now, oops, I modified an object by mistake, and put it in the queue, so the modified object has to be pickled. I'm hoping I can use a measurement of the amounts pickled to catch that mistake. – sj95126 Aug 31 '21 at 18:13
I'm OK with the answer being "that's not possible" if there really is no way to get that information out of the ```pickle``` module. – sj95126 Aug 31 '21 at 18:16
Every time something gets put into a queue it is serialized then deserialized using pickle. It doesn't matter if it has been changed or not. If you want, you could re-write then re-compile the `_pickle` module to log to a file, but that's pretty involved, and could break things. – Aaron Aug 31 '21 at 22:49
"Let's say I'm using a queue to pass objects around." therefore it gets pickled every single time. `queue.get` always returns a new object which is effectively a copy of what was `put` onto the queue. It is never the "same" object because the separate processes cannot access each-other's memory space. Otherwise you would send a pointer to a common object. – Aaron Aug 31 '21 at 22:50
When you say "queue.get always returns a new object" do you mean that the "new object" is the value stored in the queue, or the entire object the value points to? If I have a large object (say, a bytesarray) that exists before ```fork()```, all processes can see it after ```fork()```. Surely it doesn't need to pickle the entire object to pass the pointer around. The article linked at the top says that doesn't happen, at least in the case of functions. Or is it the case that ```queue``` has its own behavior? – sj95126 Sep 01 '21 at 13:55
Again, I don't want this to be about "method A does this, and method B does that". I want to know how many objects/bytes are being pickled, regardless of how or why or whether it's expected or whether the coder doesn't know the right way to do things. If that's not possible, then OK. – sj95126 Sep 01 '21 at 13:57
"Surely it doesn't need to pickle the entire object to pass the pointer around." That's exactly it... It is not possible to pass the pointer from one process to another, so it has to copy the entire object and all its data. After "fork" two copies will exist that are separate from each other. If you modify one the other won't be affected. They point to two different spaces in memory now. I guess I'm not understanding why you're worried about Pickling happening accidentally... If you call `Queue.put / get` it's pickled. If it uses a `Manager` it's pickled. Otherwise it's not. 100% of the time. – Aaron Sep 01 '21 at 22:41
I guess the answer to your question could be "If you're using queues, count the number of calls to `Queue.put / get`." – Aaron Sep 01 '21 at 22:46
I guess I'm just looking for a second level of confirmation. If you have sufficiently complex code, it's always *possible* to have made a mistake. Maybe you sent the wrong (very large) object through a queue. Or you added to the queue when you meant to add to a local list. The effects might not be obvious in the code behavior, but your code is taking longer to run than it seems like it should. If I could get a stat that says "you pickled 10GB of data" I'd at least have a clue where things went wrong (or where it didn't). That's all I'm trying to do. – sj95126 Sep 01 '21 at 22:59
It's no different than using a profiler. Sometimes knowing how your code is supposed to work isn't enough. – sj95126 Sep 01 '21 at 23:00
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236662/discussion-between-aaron-and-sj95126). – Aaron Sep 01 '21 at 23:28

Determining exactly what is pickled during Python multiprocessing

1 Answers1

popen_fork.py

popen_spawn_xxxxx.py

popen_forkserver.py

To address your concern:

Linked