4

I'm playing with Python multiprocessing module to have a (read-only) array shared among multiple processes. My goal is to use multiprocessing.Array to allocate the data and then have my code forked (forkserver) so that each worker can read straight from the array to do their job.

While reading the Programming guidelines I got a bit confused.

It is first said:

Avoid shared state

As far as possible one should try to avoid shifting large amounts of data between processes.

It is probably best to stick to using queues or pipes for communication between processes rather than using the lower level synchronization primitives.

And then, a couple of lines below:

Better to inherit than pickle/unpickle

When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.

As far as I understand, queues and pipes pickle objects. If so, aren't those two guidelines conflicting?

Thanks.

Brandt
  • 5,058
  • 3
  • 28
  • 46

1 Answers1

2

The second guideline is the one relevant to your use case.

The first is reminding you that this isn't threading where you manipulate shared data structures with locks (or atomic operations). If you use Manager.dict() (which is actually SyncManager.dict) for everything, every read and write has to access the manager's process, and you also need the synchronization typical of threaded programs (which itself may come at a higher cost from being cross-process).

The second guideline suggests inheriting shared, read-only objects via fork; in the forkserver case, this means you have to create such objects before the call to set_start_method, since all workers are children of a process created at that time.

The reports on the usability of such sharing are mixed at best, but if you can use a small number of any of the C-like array types (like numpy or the standard array module), you should see good performance (because the majority of pages will never be written to deal with reference counts). Note that you do not need multiprocessing.Array here (though it may work fine), since you do not need writes in one concurrent process to be visible in another.

Davis Herring
  • 36,443
  • 4
  • 48
  • 76
  • In other words, the second guideline is for *reading* (better to inherit data with parent processes than pickle data with queues or pipes), while the first guideline is for *writing* (better to pickle data with queues or pipes than share data with shared memory or manager processes), i.e. **inherit** (read-only) > **pickle** (read–write) > **share** (read–write), right? – Géry Ogam Feb 08 '22 at 17:51
  • 1
    Pretty much: if something is read-only, there's no point in paying to serialize it, and if something is read-write, it's best to have it spend almost all of its time local to one process. If you can use shared memory for large amounts of data with only a few writes (that need IPC synchronization), that's a separate option to consider. – Davis Herring Feb 08 '22 at 19:25
  • Doesn’t inherit > pickle contradict the paragraph ‘Explicitly pass resources to child processes’ in the [Programming guidelines](https://docs.python.org/3/library/multiprocessing.html#programming-guidelines) of the `multiprocessing` module documentation which advocates the opposite? – Géry Ogam Feb 08 '22 at 20:56
  • @Maggyero: Locks aren’t read-only, and it can of course mean something very different to share an IPC handle than to have a separate one. I think each guideline just applies in different circumstances. – Davis Herring Feb 09 '22 at 00:25
  • I see. You said ‘The second guideline suggests inheriting shared, read-only objects via `fork`’ Don’t you think that the second guideline is instead referring to the start methods `spawn` and `forkserver` and by ‘inherit’ also includes the recreated parent’s Python objects outside of `if __name__ == '__main__':` statements by the execution of the parent’s script as `'__mp_main__'`? – Géry Ogam Feb 10 '22 at 17:28
  • 1
    @Maggyero: I can’t prove otherwise, but if that’s what it means by “inherit”, it needs to be reworded! – Davis Herring Feb 10 '22 at 22:55