multiprocessing: sharing a large read-only object between processes?

Question

Do child processes spawned via multiprocessing share objects created earlier in the program?

I have the following setup:

do_some_processing(filename):
    for line in file(filename):
        if line.split(',')[0] in big_lookup_object:
            # something here

if __name__ == '__main__':
    big_lookup_object = marshal.load('file.bin')
    pool = Pool(processes=4)
    print pool.map(do_some_processing, glob.glob('*.data'))

I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.

My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?

Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.

Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.

your code as written will call `marshal.load` for parent and for each child (each process imports the module). — jfs, Mar 19 '09 at 00:08
For "local in-memory" and if you'd like to avoid copying the following might be useful http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes — jfs, Mar 19 '09 at 02:45
share no. spawned processes (i.e. fork or exec for example) *is an exact duplicate of the calling process*... but in different memory. For one process to talk to another, you need *interprocess communication* or IPC reading/writing to some *shared* memory location. — ron, Nov 08 '18 at 18:54
you can user partial function from functools, you would have to make `big_lookup_object` an argument in `do_some_processing`, it's also useful when you want to pass a lambda to a `Pool.map()` or plain python `map` — chess, Mar 25 '21 at 15:44

score 60 · Answer 1 · edited Feb 10 '22 at 15:25

Do child processes spawned via multiprocessing share objects created earlier in the program?

No for Python < 3.8, yes for Python ≥ 3.8.

Processes have independent memory space.

Solution 1

To make best use of a large structure with lots of workers, do this.

Write each worker as a "filter" – reads intermediate results from stdin, does work, writes intermediate results on stdout.

Connect all the workers as a pipeline:

process1 <source | process2 | process3 | ... | processn >result

Each process reads, does work and writes.

This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.

Solution 2

In some cases, you have a more complex structure – often a fan-out structure. In this case you have a parent with multiple children.

Parent opens source data. Parent forks a number of children.
Parent reads source, farms parts of the source out to each concurrently running child.
When parent reaches the end, close the pipe. Child gets end of file and finishes normally.

The child parts are pleasant to write because each child simply reads sys.stdin.

The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.

Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.

Reading from many named pipes is often done using the select module to see which pipes have pending input.

Solution 3

Shared lookup is the definition of a database.

Solution 3A – load a database. Let the workers process the data in the database.

Solution 3B – create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.

Solution 4

Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.

You can do this from a Python context in several ways

Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.
Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does – break the data into pages, make each page easy to locate via a seek.

Spawn workers with access to this large page-structured file. Each worker can seek to the relevant parts and do their work there.

My processes aren't really fitlers; they're all the same, just processing different pieces of data. — Parand, Mar 18 '09 at 20:20
They can often be structured as filters. They read their piece of data, do their work, and write their result for later processing. — S.Lott, Mar 18 '09 at 20:33
I like your solution, but what happens with the blocking I/O? What if the parent blocks reading/writing from/to one of its children? Select does notify you that you can write, but it doesn't say how much. Same for reading. — Cristian Ciupitu, Mar 18 '09 at 21:39
These are separate processes -- parents and children do not interfere with each other. Each byte produced at one end of a pipe is immediately available at the other end to be consumed -- a pipe is a shared buffer. Not sure what your question means in this context. — S.Lott, Mar 18 '09 at 23:16
I can verify what S.Lott said. I needed the same operations done on a single file. So the first worker ran its function on every line with number % 2 == 0 and saved it to a file, and sent the other lines to the next piped process (which was the same script). Runtime went down by half. It's a little hacky, but the overhead is much lighter than map/poop in the multiprocessing module. — Vince, Nov 30 '09 at 21:55
@Vince: There's a limit to that scaling, but you can try dividing the file 8 ways and see if the time goes to 1/8th. Often, it does. — S.Lott, Dec 01 '09 at 00:39
Does solution 1 still hold for real-time data ? Won't the overhead be a problem ? — Pe Dro, Oct 11 '20 at 09:23

jfs · Answer 2 · 2009-03-19T00:23:55.380

Do child processes spawned via multiprocessing share objects created earlier in the program?

It depends. For global read-only variables it can be often considered so (apart from the memory consumed) else it should not.

multiprocessing's documentation says:

Better to inherit than pickle/unpickle

On Windows many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which need access to a shared resource created elsewhere can inherit it from an ancestor process.

Explicitly pass resources to child processes

On Unix a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.

Apart from making the code (potentially) compatible with Windows this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.

Global variables

Bear in mind that if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start() was called.

Example

On Windows (single CPU):

#!/usr/bin/env python
import os, sys, time
from multiprocessing import Pool

x = 23000 # replace `23` due to small integers share representation
z = []    # integers are immutable, let's try mutable object

def printx(y):
    global x
    if y == 3:
       x = -x
    z.append(y)
    print os.getpid(), x, id(x), z, id(z) 
    print y
    if len(sys.argv) == 2 and sys.argv[1] == "sleep":
       time.sleep(.1) # should make more apparant the effect

if __name__ == '__main__':
    pool = Pool(processes=4)
    pool.map(printx, (1,2,3,4))

With sleep:

$ python26 test_share.py sleep
2504 23000 11639492 [1] 10774408
1
2564 23000 11639492 [2] 10774408
2
2504 -23000 11639384 [1, 3] 10774408
3
4084 23000 11639492 [4] 10774408
4

Without sleep:

$ python26 test_share.py
1148 23000 11639492 [1] 10774408
1
1148 23000 11639492 [1, 2] 10774408
2
1148 -23000 11639324 [1, 2, 3] 10774408
3
1148 -23000 11639324 [1, 2, 3, 4] 10774408
4

@cbare: Good question! z is in fact not shared, as the output with sleep shows. The output without sleep shows that a *single* process handles (PID = 1148) all the work; what we see in the last example is the value of z for this single process. — Eric O. Lebigot, Apr 20 '12 at 08:24
This answer shows that `z` is not shared. This thus answers the question with: "no, under Windows at least, a parent variable is not shared between children". — Eric O. Lebigot, Jun 01 '17 at 15:52
@EOL: technically you are correct but in practice if the data is readonly (unlike `z` case) it can be considered shared. — jfs, Jun 15 '17 at 22:52
Just to clarify, the statement _Bear in mind that if code run in a child process tries to access a global variable..._ in the 2.7 docs refers to Python running under Windows. — user1071847, Oct 23 '17 at 18:42

score 34 · Answer 3 · edited May 23 '17 at 12:34

S.Lott is correct. Python's multiprocessing shortcuts effectively give you a separate, duplicated chunk of memory.

On most *nix systems, using a lower-level call to os.fork() will, in fact, give you copy-on-write memory, which might be what you're thinking. AFAIK, in theory, in the most simplistic of programs possible, you could read from that data without having it duplicated.

However, things aren't quite that simple in the Python interpreter. Object data and meta-data are stored in the same memory segment, so even if the object never changes, something like a reference counter for that object being incremented will cause a memory write, and therefore a copy. Almost any Python program that is doing more than "print 'hello'" will cause reference count increments, so you will likely never realize the benefit of copy-on-write.

Even if someone did manage to hack a shared-memory solution in Python, trying to coordinate garbage collection across processes would probably be pretty painful.

Only the memory region of the ref count will be copied in that case, not necessarily the large read-only data, isn't it? — kawing-chiu, Nov 17 '16 at 07:31

score 8 · Answer 4 · answered Mar 18 '09 at 20:44

8

If you're running under Unix, they may share the same object, due to how fork works (i.e., the child processes have separate memory but it's copy-on-write, so it may be shared as long as nobody modifies it). I tried the following:

import multiprocessing

x = 23

def printx(y):
    print x, id(x)
    print y

if __name__ == '__main__':
    pool = multiprocessing.Pool(processes=4)
    pool.map(printx, (1,2,3,4))

and got the following output:

$ ./mtest.py
23 22995656
1
23 22995656
2
23 22995656
3
23 22995656
4

Of course this doesn't prove that a copy hasn't been made, but you should be able to verify that in your situation by looking at the output of ps to see how much real memory each subprocess is using.

answered Mar 18 '09 at 20:44

Jacob Gabrielson

34,800
15
46
64

2

What about the garbage collector? What happens when it runs? Doesn't the memory layout change? – Cristian Ciupitu Mar 18 '09 at 21:42
That's a valid concern. Whether it would affect Parand would depend on how he's using all of this and how reliable this code has to be. If it weren't working for him I'd recommend using the mmap module for more control (assuming he wants to stick with this basic approach). – Jacob Gabrielson Mar 18 '09 at 21:54
I've posted an update to your example: http://stackoverflow.com/questions/659865/python-multiprocessing-sharing-a-large-read-only-object-between-processes/660468#660468 – jfs Mar 19 '09 at 01:51
@JacobGabrielson: The copy is made. The original question is about whether the copy is made. – abhinavkulkarni Sep 19 '13 at 01:26

score 3 · Answer 5 · answered Mar 18 '09 at 20:14

3

Different processes have different address space. Like running different instances of the interpreter. That's what IPC (interprocess communication) is for.

You can use either queues or pipes for this purpose. You can also use rpc over tcp if you want to distribute the processes over a network later.

http://docs.python.org/dev/library/multiprocessing.html#exchanging-objects-between-processes

answered Mar 18 '09 at 20:14

Vasil

36,468
26
90
114

3

I don't think IPC would be appropriate for this; this is read-only data that everybody needs access to. No sense passing it around between processes; at worst each can read its own copy. I'm attempting to save memory by not having a separate copy in each process. – Parand Mar 18 '09 at 20:21
You can have a master process delegating pieces of data to work on to the other slave processes. Either the slaves can ask for data or it can push data. This way not every process will have a copy of the whole object. – Vasil Mar 18 '09 at 20:39
1

@Vasil: What if each process needs the whole data set, and is just running a different operation on it? – Will Jun 02 '13 at 22:55
I must add that huge data transferring can kill the parallelization performance. Even if not part of OP question, the same point is valid for volatile nature work, where atomicity is not needed. – RomuloPBenedetti Feb 20 '22 at 01:16

score 1 · Answer 6 · 2019-03-20T02:54:25.117

No, but you can load your data as a child process and allow it to share its data with other children. see below.

import time
import multiprocessing

def load_data( queue_load, n_processes )

    ... load data here into some_variable

    """
    Store multiple copies of the data into
    the data queue. There needs to be enough
    copies available for each process to access. 
    """

    for i in range(n_processes):
        queue_load.put(some_variable)


def work_with_data( queue_data, queue_load ):

    # Wait for load_data() to complete
    while queue_load.empty():
        time.sleep(1)

    some_variable = queue_load.get()

    """
    ! Tuples can also be used here
    if you have multiple data files
    you wish to keep seperate.  
    a,b = queue_load.get()
    """

    ... do some stuff, resulting in new_data

    # store it in the queue
    queue_data.put(new_data)


def start_multiprocess():

    n_processes = 5

    processes = []
    stored_data = []

    # Create two Queues
    queue_load = multiprocessing.Queue()
    queue_data = multiprocessing.Queue()

    for i in range(n_processes):

        if i == 0:
            # Your big data file will be loaded here...
            p = multiprocessing.Process(target = load_data,
            args=(queue_load, n_processes))

            processes.append(p)
            p.start()   

        # ... and then it will be used here with each process
        p = multiprocessing.Process(target = work_with_data,
        args=(queue_data, queue_load))

        processes.append(p)
        p.start()

    for i in range(n_processes)
        new_data = queue_data.get()
        stored_data.append(new_data)    

    for p in processes:
        p.join()
    print(processes)

score 1 · Answer 7 · answered Mar 18 '09 at 21:35

1

Not directly related to multiprocessing per se, but from your example, it would seem you could just use the shelve module or something like that. Does the "big_lookup_object" really have to be completely in memory?

answered Mar 18 '09 at 21:35

Good point, I haven't directly compared performance of on-disk vs. in memory. I had assumed there would be a big difference, but I haven't actually tested. – Parand Mar 18 '09 at 23:57

score -3 · Answer 8 · answered Dec 07 '15 at 21:37

-3

For Linux/Unix/MacOS platform, forkmap is a quick-and-dirty solution.

answered Dec 07 '15 at 21:37

Maxim Imakaev

1,435
2
13
26

multiprocessing: sharing a large read-only object between processes?

8 Answers8

Do child processes spawned via multiprocessing share objects created earlier in the program?

Example

Linked

Related