Why is the function passed to Pool.map pickled when mutiprocessing uses fork as a starting method?

Question

In Linux, the multiprocesing module uses fork as the default starting method for a new process. Why is then necessary to pickle the function passed to map? As far as I understand all the state of the process is cloned, including the functions. I can imagine why that's necessary if spawn is used but not for fork.

I suspect that IPC is just the lowest common denominator that can target all three starting methods. Taking advantage of the inherited address space provided by `fork` would require a greater amount of specialized code than just assuming that pickling is still necessary. — chepner, Oct 18 '21 at 16:07
@chepner I suspect that's probably the main reason, but given that many of the problems with multiprocessing involve pickle, I would be very happy to have a multiprocessing flag preventing pickling unnecessarily. Otherwise, many of us are forced to use dill or pathos for example. — Jorge E. Cardona, Oct 18 '21 at 16:13
I'm not a core developer, so I don't want to speak for them, but I suspect that the *possibility* of using a 3rd-party module makes any change to the `multiprocessing` module a low-priority feature request. — chepner, Oct 18 '21 at 16:22
I asked a [recent question](https://stackoverflow.com/questions/68806714/determining-exactly-what-is-pickled-during-python-multiprocessing) that was more or less targeted at this very problem. The answer ultimately seems to be "you should understand how multiprocessing uses pickling and you can't measure it" which didn't really help. There does seem to be some non-obvious pickling going on in certain cases. — sj95126, Oct 18 '21 at 16:30
Perhaps there is an advantage to always pickling, so that code either works or crashes in the same way across platform? — wim, Oct 18 '21 at 16:32
Who says that the function you’re calling existed at the time of the fork? — Davis Herring, Oct 18 '21 at 18:08
@DavisHerring Sure, that can happen. But I am more concern with the simpler case where a "nice" function is defined as a lambda or with nested functions even before the Pool is constructed. — Jorge E. Cardona, Oct 20 '21 at 10:16

score 2 · Answer 1 · answered Oct 18 '21 at 22:19

2

Job-methods like .map() don't start new processes so exploiting fork at this point would not be an option. Pool uses IPC to pass arguments to already running worker-processes and this always requires serialization (pickling). It seems there's some deeper misunderstanding with what pickling here involves, though.

When you look at job-methods like .map(), the pickling for your function here just results in the qualified function-name getting send as string and the receiving process during unpickling basically just looks up the function in its global scope for a reference to it again.

Now between spawn and fork there is a difference, but it already materializes as soon as worker-processes boot up (starts with initializing Pool). With spawn-context, the new worker needs to build up all reachable global objects from scratch, with fork they're already there. So your function will be cloned once during boot up when you use fork and it will save a little time.

When you start sending jobs later, unpickling your sent function in the worker, with any context, just means re-referencing the function from global scope again. That's why the function needs to exist before you instantiate the pool and workers are launched, even for usage with spawn-context.

So the inconveniences you might experience with not being able to pickle local or unnamed-functions (lambdas) is rooted in the problem of regaining a reference to your (then) already existing function in the worker-processes. If spawn or fork is used for setting up the worker-processes before, doesn't make a difference at this point.

answered Oct 18 '21 at 22:19

Darkonaut

20,186
7
54
65

Sure, a `Pool` is created first, but I can imagine being able to pass to that pool a set of functions to keep track that will later be used in a map without pickling their names. There are ways to implement this pool/map scenario from scratch without pickling the function, just the data out. Given that many problems in multiprocessing are related with pickle it seems that Unix users are suffering unnecesarily by the "uniform API across platforms" argument. – Jorge E. Cardona Oct 19 '21 at 14:27
@JorgeE.Cardona What kind of "suffering" are you talking about here, specifically? The qualified name is just a short string (< 100 bytes) and the pool needs some way to locate the worker's callable within the forked process anyway. Yes you could create a way to do that pre-fork, in theory, but it would be a different API than pool/map. – wim Oct 20 '21 at 17:49
@JorgeE.Cardona Which code would call these functions then? Workers immediately need to run some code after boot up or they shut down. They currently run a `worker()`-function which manages in- and outqueue and awaits a user-function to run together with some arguments as a task. What you describe would require exchanging this function with your custom implementation in the source code. It would result in hard coding your specific business logic, the rules for which other functions need to be called and when. Needless to say it would cease being general purpose code you'd seek in a stdlib. – Darkonaut Oct 20 '21 at 17:51
@JorgeE.Cardona ...It wouldn't even tackle major obstacles people run into, because you still separate code from data, so the only benefit left would be being able to pass functions without a global reference. If you intend to employ your own special worker-function, I don't see a way around using raw `Process` and some multiprocessing-queues, something `Pool` otherwise is managing for you. – Darkonaut Oct 20 '21 at 17:52
@Darkonaut I guess using the word "suffering" is unnecessary, I am refering to many request for help online when using lambda or callables not defined in the top-level of a module. I understand the requirements given how and when Pools are created before map is called. But I am also aware that for Linux users, in particular, an implementation based on simple low-level mechanisms (os.fork, os.pipe, struct, and pickle for data) is possible, allowing to use very generic callables. – Jorge E. Cardona Oct 25 '21 at 14:14
@JorgeE.Cardona Note the "suffering"-question came from wim (not me). You even don't have to go low-level for exploiting fork somewhat (even for data one-way), it's "just" that you can't put that code in stdlib and have it general purpose then. – Darkonaut Oct 25 '21 at 14:54
@Darkonaut oh, I didn't notice that. I'm sorry. In other notes, after putting some time in implementing a `map` based on os.fork and low-level mechanisms I realized I can implement what I was trying to say just with the initializer and initargs of Pool. I can pass the callable as an initarg of the initializer of the pool workers and set it in a property of a top-level function, pass this function to map and use the property set during initialization. Sadly it ends up being 10x slower than my other implementation. – Jorge E. Cardona Oct 25 '21 at 15:36
@wim I guess using the word "suffering" is unnecessary, I am refering to many request for help online when using lambda or callables not defined in the top-level of a module. I can achieve what I am refering with the initializer and initargs of pool, those are not pickled. – Jorge E. Cardona Oct 25 '21 at 15:38

Why is the function passed to Pool.map pickled when mutiprocessing uses fork as a starting method?

1 Answers1