11

How do I give child processes access to data in shared memory if the data is only available after the child processes have been spawned (using multiprocessing.Process)?

I am aware of multiprocessing.sharedctypes.RawArray, but I can't figure out how to give my child processes access to a RawArray that is created after the processes have already started.

The data is generated by the parent process, and the amount of data is not known in advance.

If not for the GIL I'd be using threading instead which will make this task a little simpler. Using a non-CPython implementation is not an option.


Looking under the hood of muliprocessing.sharedctypes, it looks like shared ctype objects are allocated using mmaped memory.

So this question really boils down to: Can a child process access an anonymously mapped memory if mmap() was called by the parent after the child process was spawned?

That's somewhat in the vein of what's being asked in this question, except that in my case the caller of mmap() is the parent process and not the child process.


(Solved)

I created my own version of RawArray that uses shm_open() under the hood. The resulting shared ctypes array can be shared with any process as long as the identifier (tag) matches.

See this answer for details and an example.

Community
  • 1
  • 1
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
  • Can't you *start* your processes with the message container (sequence, `RawArray`, whatever) as argument? Although intially empty, it will be passed as reference (not sure here) and the processes should be able to read (and write) it... Or am I mistaken? – eudoxos Sep 14 '11 at 16:19
  • The number of elements is not known in advance so I can't create the container yet (unless I create one that's obscenely huge to cater for all possibilities). – Shawn Chin Sep 14 '11 at 16:28
  • if you create `[]`, you can resize it as you wish later, while it will still be the same object... am I missing something? – eudoxos Sep 14 '11 at 16:30
  • 1
    That won't work across processes. – Shawn Chin Sep 14 '11 at 16:32
  • Why wouldn't you start with basic IPC like files or pipes/socketpairs? – Maxim Egorushkin Sep 15 '11 at 11:21
  • @Maxim I'm trying to give all procs access to a large in-memory data structure. Sending the data to procs using IPC means each proc will end up with its own copy and blow my memory capacity. – Shawn Chin Sep 15 '11 at 12:30
  • You would still use the shared memory, but communicate to the child process that the data has become available through the pipe. This is a bit easier than juggling mutexes shared between the processes. – Maxim Egorushkin Sep 15 '11 at 12:59
  • @Maxin The question here is how to give the child processes access to shared memory, given that the size of the memory is only known long after the processes have already started. – Shawn Chin Sep 15 '11 at 13:04

3 Answers3

7

Disclaimer: I am the author of the question.

I eventually used the posix_ipc module to create my own version of RawArray. I used mainly posix_ipc.SharedMemory which calls shm_open() under the hood.

My implementation (ShmemRawArray) exposes the same functionality as RawArray but required two additional parameters - a tag to uniquely identify the shared memory region, and a create flag to determine if we should be created a new shared memory segment or attach to an existing one.

Here's a copy if anyone's interested: https://gist.github.com/1222327

ShmemRawArray(typecode_or_type, size_or_initializer, tag, create=True)

Usage notes:

  • The first two args (typecode_or_type and size_or_initializer) should work the same as with RawArray.
  • The shared array is accessible by any process, as long as tag matches.
  • The shared memory segment is unlinked when the origin object (returned by ShmemRawArray(..., create=True)) is deleted
  • Creating an shared array using a tag that currently exists will raise an ExistentialError
  • Accessing a shared array using a tag that doesn't exist (or one that has been unlinked) will also raise an ExistentialError

A SSCCE (Short, Self Contained, Compilable Example) showing it in action.

#!/usr/bin/env python2.7
import ctypes
import multiprocessing
from random import random, randint
from shmemctypes import ShmemRawArray

class Point(ctypes.Structure):
    _fields_ = [ ("x", ctypes.c_double), ("y", ctypes.c_double) ]

def worker(q):
    # get access to ctypes array shared by parent
    count, tag = q.get()
    shared_data = ShmemRawArray(Point, count, tag, False)

    proc_name = multiprocessing.current_process().name
    print proc_name, ["%.3f %.3f" % (d.x, d.y) for d in shared_data]

if __name__ == '__main__':
    procs = []
    np = multiprocessing.cpu_count()
    queue = multiprocessing.Queue()

    # spawn child processes
    for i in xrange(np):
        p = multiprocessing.Process(target=worker, args=(queue,))
        procs.append(p)
        p.start()

    # create a unique tag for shmem segment
    tag = "stack-overflow-%d" % multiprocessing.current_process().pid

    # random number of points with random data
    count = randint(3,10) 
    combined_data = [Point(x=random(), y=random()) for i in xrange(count)]

    # create ctypes array in shared memory using ShmemRawArray
    # - we won't be able to use multiprocssing.sharectypes.RawArray here 
    #   because children already spawned
    shared_data = ShmemRawArray(Point, combined_data, tag)

    # give children info needed to access ctypes array
    for p in procs:
        queue.put((count, tag))

    print "Parent", ["%.3f %.3f" % (d.x, d.y) for d in shared_data]
    for p in procs:
        p.join()

Running this results in the following output:

[me@home]$ ./shmem_test.py
Parent ['0.633 0.296', '0.559 0.008', '0.814 0.752', '0.842 0.110']
Process-1 ['0.633 0.296', '0.559 0.008', '0.814 0.752', '0.842 0.110']
Process-2 ['0.633 0.296', '0.559 0.008', '0.814 0.752', '0.842 0.110']
Process-3 ['0.633 0.296', '0.559 0.008', '0.814 0.752', '0.842 0.110']
Process-4 ['0.633 0.296', '0.559 0.008', '0.814 0.752', '0.842 0.110']
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
6

Your problem sounds like a perfect fit for the posix_ipc or sysv_ipc modules, which expose either the POSIX or SysV APIs for shared memory, semaphores, and message queues. The feature matrix there includes excellent advice for picking amongst the modules he provides.

The problem with anonymous mmap(2) areas is that you cannot easily share them with other processes -- if they were file-backed, it'd be easy, but if you don't actually need the file for anything else, it feels silly. You could use the CLONE_VM flag to the clone(2) system call if this were in C, but I wouldn't want to try using it with a language interpreter that probably makes assumptions about memory safety. (It'd be a little dangerous even in C, as maintenance programmers five years from now might also be shocked by the CLONE_VM behavior.)

But the SysV and newer POSIX shared memory mappings allow even unrelated processes to attach and detach from shared memory by identifier, so all you need to do is share the identifier from the processes that create the mappings with the processes that consume the mappings, and then when you manipulate data within the mappings, they are available to all processes simultaneously without any additional parsing overhead. The shm_open(3) function returns an int that is used as a file descriptor in later calls to ftruncate(2) and then mmap(2), so other processes can use the shared memory segment without a file being created in the filesystem -- and this memory will persist even if all processes using it have exited. (A little strange for Unix, perhaps, but it is flexible.)

sarnold
  • 102,305
  • 22
  • 181
  • 238
  • Aha! That sounds doable. I'll give it a go and get back to you. Thanks! (I've been looking for a python module that exposes `shm_open` and friends, and all I found was [shm](http://nikitathespider.com/python/shm/) which looked a little stale). – Shawn Chin Sep 16 '11 at 07:17
  • You post led me to [a solution](http://stackoverflow.com/questions/7419159/giving-access-to-shared-memory-after-child-processes-have-already-started/7447103#7447103). Many thanks. – Shawn Chin Sep 16 '11 at 15:39
  • @sarnold - instead of clone() I would strongly prefer the use of minherit for mmaped areas – Good Person Mar 22 '13 at 02:42
0

I think you are looking for mmap module

concerning the serializiation of data this question answer of course if you hope to avoid copy I have not the solution

EDIT

in fact you can use the non stdlib _mutliprocessing module in CPython 3.2 to have the address of the mmap object and use it with from_address of a ctypes object it is what in fact what does RawArray in fact of course you should not try to resize the mmap object as the address of mmap may change in this case

import mmap
import _multiprocessing
from ctypes import Structure,c_int

map = mmap.mmap(-1,4)
class A(Structure):
    _fields_ = [("x", c_int)]
x = _multiprocessing.address_of_buffer(map)
b=A.from_address(x[0])
b.x = 256

>>> map[0:4]
'\x00\x01\x00\x00'

to expose the memory after the child is created you have to map your memory with a real file that is calling

map = mmap.mmap(open("hello.txt", "r+b").fileno(),4)
Community
  • 1
  • 1
Xavier Combelle
  • 10,968
  • 5
  • 28
  • 52
  • Unless I'm mistaken, that would require me to serialise my data and then load it in the child process. That's what I'm trying to avoid. – Shawn Chin Sep 14 '11 at 16:08
  • As I understand the problem you try to avoid broadcasting the messages not really avoid serialization – Xavier Combelle Sep 14 '11 at 18:23
  • I'm concerned about performance and memory usage. Having to deserialise the same data on all procs does not sound appealing, hence my comment about serialisation. – Shawn Chin Sep 14 '11 at 18:27
  • Thanks for the update Xavier. I'm afraid I don't understand how implementing my own mmap module is going to help in my case. – Shawn Chin Sep 15 '11 at 11:03
  • re your updates. That's pretty much what's being done under the hood of `multiprocessing.sharedctypes.RawArray`, and that all works as advertised. I'm trying to expose shared memory to child processes _after_ they have already been spawned. – Shawn Chin Sep 15 '11 at 14:09