8

I was doing multiprocessing in python and hit a pickling error. Which makes me wonder why do we need to pickle the object in order to do multiprocessing? isn't fork() enough?

Edit: I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?

Matthew Smith
  • 508
  • 7
  • 22
Gaaaaaaaa
  • 167
  • 1
  • 13
  • 3
    Possible duplicate of [Why does python multiprocessing pickle objects to pass objects between processes?](https://stackoverflow.com/questions/24563475/why-does-python-multiprocessing-pickle-objects-to-pass-objects-between-processes) – Alex Taylor Oct 01 '18 at 23:44
  • Strongly related to [Why does python multiprocessing pickle objects to pass objects between processes?](https://stackoverflow.com/questions/24563475/why-does-python-multiprocessing-pickle-objects-to-pass-objects-between-processes), but the question should better be "Why is pickle needed for multiprocessing module in python **instead of using os.fork()**?" -- or not? – colidyre Oct 01 '18 at 23:51
  • Pickle cannot serialise functions. Well, you can pickle a function but what really happens is that the pickle only contains the function name, which must be resolvable to the same function in the receiving side. – Lie Ryan Oct 02 '18 at 03:31

1 Answers1

6

Which makes me wonder why do we need to pickle the object in order to do multiprocessing?

We don't need pickle, but we do need to communicate between processes, and pickle happens to be a very convenient, fast, and general serialization method for Python. Serialization is one way to communicate between processes. Memory sharing is the other. Unlike memory sharing, the processes don't even need to be on the same machine to communicate. For example, PySpark using serialization very heavily to communicate between executors (which are typically different machines).

Addendum: There are also issues with the GIL (Global Interpreter Lock) when sharing memory in Python (see comments below for detail).

isn't fork() enough?

Not if you want your processes to communicate and share data after they've forked. fork() clones the current memory space, but changes in one process won't be reflected in another after the fork (unless we explicitly share data, of course).

I kind of get why we need pickle to do interprocess communication, but that is just for the data you want to transfer right? why does the multiprocessing module also try to pickle stuff like functions etc?

  1. Sometimes complex objects (i.e. "other stuff"? not totally clear on what you meant here) contain the data you want to manipulate, so we'll definitely want to be able to send that "other stuff".

  2. Being able to send a function to another process is incredibly useful. You can create a bunch of child processes and then send them all a function to execute concurrently that you define later in your program. This is essentially the crux of PySpark (again a bit off topic, since PySpark isn't multiprocessing, but it feels strangely relevant).

  3. There are some functional purists (mostly the LISP people) that make arguments that code and data are the same thing. So it's not much of a line to draw for some.

Matt Messersmith
  • 12,939
  • 6
  • 51
  • 52
  • 2
    The other drawback of memory sharing is it needs synchronisation. Python has a GIL that protects multiple threads from accessing the same object (including interpreter states like refcounts). multiprocessing uses serialisation because it uses message passing (e.g. Queue) as the primary form of IPC to avoid needing a multiprocess lock. – Lie Ryan Oct 02 '18 at 03:36