Multiprocessing in Python with read-only shared memory?

Question

I have a single-threaded Python program, and I'd like to modify it to make use of all 32 processors on the server it runs on. As I envision it, each worker process would receive its job from a queue and submit its output to a queue. To complete its work, however, each worker process would need read-only access to a complex in-memory data structure--many gigabytes of dicts and objects that link to each other. In python, is there a simple way to share this data structure, without making a copy of it for each worker process?

Thanks.

Yes, for now I am using CPython, running on Linux. I had in mind using the multiprocessing module. — Jeff, Oct 14 '13 at 18:21

unutbu · Answer 1 · 2013-10-14T18:37:48.613

If you are using the CPython (or PyPy) implementation of Python, then the global interpreter lock (GIL) will prevent more than one thread from operating on Python objects at a time.

So if you are using such an implementation, you'll need to use multiple processes instead of multiple threads to take advantage of your 32 processors.

You could use the the standard library's multiprocessing or concurrent.futures modules to spawn the worker processes. There are also many third-party options. Doug Hellman's tutorial is a great introduction to the multiprocessing module.

Since you only need read-only access to the data structure, if you assign the complex data structure to a global variable before you spawn the processes, then all the processes will have access to this global variable.

When you spawn a process, the globals from the calling module are copied to the spawned process. However, on Linux, which has copy-on-write, the very same data structure(s) is used by the spawned processes, so no extra memory is required. Only when a process modifies the data structure is it copied to a new location.

On Windows, since there is no fork, each spawned process calls python and re-imports the calling module, so each process requires memory for its own separate copy of the huge data structure. There must be some other way to share data structures on Windows, but I'm unaware of the details. (Edit: POSH may be a solution to the shared-memory problem, but I haven't tried it myself.)

That's great, since I'm using Linux. After reading the documentation of the multiprocessing package, I thought I had to use `multiprocessing.Value` or `multiprocessing.Array` to share memory. But from your comments, it sounds like that's only necessary if I want to write to shared memory. Is that correct? — Jeff, Oct 14 '13 at 18:30
@Jeff: Right. There is [some code here](http://stackoverflow.com/a/13128386/190597) which demonstrates that with copy-on-write, large-objects can be shared for read-only purposes without requiring extra memory. — unutbu, Oct 14 '13 at 18:34
What about the problem described here: https://llvllatrix.wordpress.com/2016/02/19/python-vs-copy-on-write/ — Radio Controlled, Oct 20 '20 at 14:14

ryanwc · Answer 2 · 2020-02-22T03:46:35.177

To add demonstration of unutbu's answer above, here is code showing that it is in fact COW shared memory (CPython 3.6, Mac OS)

main_shared.py

import multiprocessing
from time import sleep


my_global = None


def test():
    global my_global
    read_only_secs = 3
    while read_only_secs > 0:
        sleep(1)
        print(f'child proc global: {my_global} at {hex(id(my_global))}')
        read_only_secs -= 1
    print('child proc writing to copy-on-write...')
    my_global = 'something else'
    while True:
        sleep(1)
        print(f'child proc global: {my_global} at {hex(id(my_global))}')


def set_func():
    global my_global
    my_global = [{'hi': 1, 'bye': 'foo'}]

if __name__ == "__main__":
    print(f'main proc global: {my_global} at {hex(id(my_global))}')
    set_func()
    print(f'main proc global: {my_global} at {hex(id(my_global))}')
    p1 = multiprocessing.Process(target=test)
    p1.start()

    while True:
        sleep(1)
        print(f'main proc global: {my_global} at {hex(id(my_global))}')

Output

$ python main_shared.py 
main proc global: None at 0x101b509f8
main proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
child proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
main proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
child proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
main proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
child proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
child proc writing to copy-on-write...
main proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
child proc global: something else at 0x1022ea3b0
main proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
child proc global: something else at 0x1022ea3b0
main proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708
child proc global: something else at 0x1022ea3b0
main proc global: [{'hi': 1, 'bye': 'foo'}] at 0x102341708

Multiprocessing in Python with read-only shared memory?

2 Answers2

Linked