3

I have a large, read-only bytes object that I need to operate against across several different Python (3) processes, with each one "returning" (adding to a result queue) a list of results based on their work.

Since this object is very large and read-only, I'd like to avoid copying it into the address space of each worker process. The research I've done suggests that shared memory is the right way to go about this, but I couldn't find a good resource/example of how exactly to do this with the multiprocessing module.

Thanks in advance.

yossarian
  • 1,537
  • 14
  • 21
  • 1
    What OS are you using? – zwer Jul 26 '17 at 14:43
  • Linux (Ubuntu LTS). An ideal solution would work across Windows as well, but that can be sacrificed if necessary. – yossarian Jul 26 '17 at 14:45
  • 1
    Then just load your data and access it from the global namespace of your main process - on POSIX/fork-enabled systems `multiprocessing` just forks the current process so you can take the copy-on-write benefits. Just make sure you don't do anything to modify that data because at that point it will be copied to your sub-process stack. – zwer Jul 26 '17 at 14:53
  • Thanks for the tip. I saw from some other SO questions that I can take advantage of CoW, *until* the Python runtime itself updates any metadata associated with the object (i.e., even if I don't modify the object itself). Is that a practical concern? – yossarian Jul 26 '17 at 15:02
  • 1
    That depends on the data... While there are a few scenarios that I know of, chances are that standard CPython won't be inclined to mess with statically accessed string/bytes structure initialized early on - I'd just avoid hard-slicing if you need a large chunks of the data later and use ranged iterators instead. – zwer Jul 26 '17 at 15:52
  • Good to know. I'll give that a shot, thanks! – yossarian Jul 26 '17 at 16:04
  • @zwer I've often seen changes in refcount cause an object to get (at least partially) paged in (since it is stored on the object) -- might need a more clever solution – anthony sottile Jul 26 '17 at 16:27
  • @AnthonySottile - true, refcount will mess up the object (thanks Python designers) although for simple objects it will only copy one page of memory and leave the rest intact. Also, if the object is accessed exactly once the refcount won't be copied at all (because it has the same number of references in the main process). `multiprocessing.Array` can help with that by keeping the refcount and the data separate, but it will still get copied with a hard slice. – zwer Jul 26 '17 at 16:45

1 Answers1

3

You can use a multiprocessing.Array, which is like ctypes.Array but for shared memory, when given a ctypes type.

# No lock needed, as no write will be done.
array = multiprocessing.Array(ctypes.c_char, long_byte_string, lock=False)

For example:

>>> import multiprocessing
>>> import ctypes
>>> array = multiprocessing.Array(ctypes.c_char, b'\x01\x02\xff\xfe', lock=False)
>>> array[0]
b'\x01'
>>> array[2:]
b'\xff\xfe'
>>> array[:]
b'\x01\x02\xff\xfe'
>>> b'\xff' in array
True
Artyer
  • 31,034
  • 3
  • 47
  • 75
  • Thanks for the response. Do you happen to know if `array[:]` accesses an internal (shared) representation, or instantiates the returned `bytes` within the child process? – yossarian Jul 26 '17 at 15:08
  • 1
    @woodruffw The internal representation is as a `multiprocessing.sharedctypes.c_char_Array_4`. Even a slicing on a regular bytes object creates a new object (i.e. `bytestring[1:] is not bytestring`). I assume it has to create a new bytes object to access the whole bytestring, but if you just take parts, it should not use as much memory. But this avoids pipes and things, as it is from shared memory. – Artyer Jul 26 '17 at 15:16
  • Thanks again. Unfortunately I need the entire object all at once, so I can't do much with the shared `Array` instance unless the internal representation is something I can use without creating new copies. – yossarian Jul 26 '17 at 15:27
  • 1
    @woodruffw I don't think it's possible to share Python objects accross processes, as processes have independent memory space. You can have (different) small objects that just refer to the same memory (Like this, think of it as a proxy), but you can't have to objects with the same memory location in different processes. – Artyer Jul 26 '17 at 15:32
  • Yeah, that's what I'm realizing. Thanks anyways. – yossarian Jul 26 '17 at 15:47
  • It might be possible to share read only across process using `memmap`. See for instance [`joblib`](https://pythonhosted.org/joblib/) which memmaps large numpy array to share then across processes without duplicating it. – Thomas Moreau Jul 27 '17 at 16:12