Share Large, Read-Only Numpy Array Between Multiprocessing Processes

Question

I have a 60GB SciPy Array (Matrix) I must share between 5+ multiprocessing Process objects. I've seen numpy-sharedmem and read this discussion on the SciPy list. There seem to be two approaches--numpy-sharedmem and using a multiprocessing.RawArray() and mapping NumPy dtypes to ctypes. Now, numpy-sharedmem seems to be the way to go, but I've yet to see a good reference example. I don't need any kind of locks, since the array (actually a matrix) will be read-only. Now, due to its size, I'd like to avoid a copy. It sounds like the correct method is to create the only copy of the array as a sharedmem array, and then pass it to the Process objects? A couple of specific questions:

What's the best way to actually pass the sharedmem handles to sub-Process()es? Do I need a queue just to pass one array around? Would a pipe be better? Can I just pass it as an argument to the Process() subclass's init (where I'm assuming it's pickled)?
In the discussion I linked above, there's mention of numpy-sharedmem not being 64bit-safe? I'm definitely using some structures that aren't 32-bit addressable.
Are there tradeoff's to the RawArray() approach? Slower, buggier?
Do I need any ctype-to-dtype mapping for the numpy-sharedmem method?
Does anyone have an example of some OpenSource code doing this? I'm a very hands-on learned and it's hard to get this working without any kind of good example to look at.

If there's any additional info I can provide to help clarify this for others, please comment and I'll add. Thanks!

This needs to run on Ubuntu Linux and Maybe Mac OS, but portability isn't a huge concern.

If the different processes are going to write to that array, expect `multiprocessing` to make a copy of the whole thing for each process. — tiago, Jul 22 '13 at 11:33
@tiago: "I don't need any kind of locks, since the array (actually a matrix) will be read-only" — Dr. Jan-Philip Gehrcke, Jul 22 '13 at 12:04
@tiago: also, multiprocessing is not making a copy as long as not explicitly told to (via arguments to the `target_function`). The operating system is going to copy parts of the parent's memory to the child's memory space upon modification only. — Dr. Jan-Philip Gehrcke, Jul 22 '13 at 12:11
here's [a RawArray-based example that should work both on *nix and Windows, and it also supports writing to the array](http://stackoverflow.com/a/7908612/4279). — jfs, Jul 24 '13 at 10:19
I asked a [few](https://stackoverflow.com/questions/37705974/why-are-multiprocessing-sharedctypes-assignments-so-slow) [questions](https://stackoverflow.com/questions/32566404/assertion-error-when-using-multiprocessing-in-python-3-4) about this before. My solution can be found here: https://github.com/david-hoffman/peaks/blob/3a2c1c39119388a8996cdd510524a48ee5d7907d/stackanalysis.py#L85-L108 (sorry the code is a disaster). — David Hoffman, Jul 31 '20 at 05:58

Dr. Jan-Philip Gehrcke · Answer 1 · 2013-07-23T09:18:04.090

44

If you are on Linux (or any POSIX-compliant system), you can define this array as a global variable. multiprocessing is using fork() on Linux when it starts a new child process. A newly spawned child process automatically shares the memory with its parent as long as it does not change it (copy-on-write mechanism).

Since you are saying "I don't need any kind of locks, since the array (actually a matrix) will be read-only" taking advantage of this behavior would be a very simple and yet extremely efficient approach: all child processes will access the same data in physical memory when reading this large numpy array.

Don't hand your array to the Process() constructor, this will instruct multiprocessing to pickle the data to the child, which would be extremely inefficient or impossible in your case. On Linux, right after fork() the child is an exact copy of the parent using the same physical memory, so all you need to do is making sure that the Python variable 'containing' the matrix is accessible from within the target function that you hand over to Process(). This you can typically achieve with a 'global' variable.

Example code:

from multiprocessing import Process
from numpy import random


global_array = random.random(10**4)


def child():
    print sum(global_array)


def main():
    processes = [Process(target=child) for _ in xrange(10)]
    for p in processes:
        p.start()
    for p in processes:
        p.join()


if __name__ == "__main__":
    main()

On Windows -- which does not support fork() -- multiprocessing is using the win32 API call CreateProcess. It creates an entirely new process from any given executable. That's why on Windows one is required to pickle data to the child if one needs data that has been created during runtime of the parent.

edited Jul 23 '13 at 09:18

answered Jul 22 '13 at 11:29

Dr. Jan-Philip Gehrcke

33,287
14
85
130

Thanks, but, doesn't referencecounting force a copy? I know how copy-on-write works in a C program, but how can I guarantee there are no "writes"? I feel like I need some kind of shared memory here. – Will Jul 22 '13 at 18:17
4

Copy-on-write will copy the page containing the reference counter (so each forked python will have its own reference counter) but it will not copy the whole data array. – robince Jul 22 '13 at 20:50
1

I would add that I've had more success with module level variables than with global variables... ie add the variable to a module in global scope before the fork – robince Jul 22 '13 at 20:51
7

A word of caution to people stumbling across this question/answer: If you happen to be using OpenBLAS-linked Numpy for its multithreadead operation, make sure to disable its multithreading (export OPENBLAS_NUM_THREADS=1) when using `multiprocessing` or child processes might end up hanging (typically using 1/n of _one_ processor rather than n processors) upon performing linear algebra operations on a shared global array/matrix. The [known multithreaded conflict with OpenBLAS](https://github.com/xianyi/OpenBLAS/wiki/faq#wiki-multi-threaded) seems to extend to Python `multiprocessing` – Dologan Jul 31 '13 at 18:27
1

Can anyone explain why python wouldn't just use OS `fork` to pass parameters given to `Process`, instead of serializing them? That is, couldn't the `fork` be applied to the parent process just *before* `child` is called, so that the parameter value is still available from the OS? Would seem to be more efficient than serializing it? – max Mar 22 '15 at 20:13
1

@max providing a common interface on top of very different platforms (Windows and POSIX are *very* different in terms of process creation) requires making compromises. Here, I guess, the compromise is to use the *same* method of parameter transfer on both platforms by default, for better maintainability and for ensuring equal behavior. Nevertheless, if you are aware of it, you can take advantage of POSIX fork() yourself, as I proposed in this answer. This is not difficult at all, so there is no particular "need" for multiprocessing to support it. – Dr. Jan-Philip Gehrcke Mar 23 '15 at 10:05
1

@Jan-PhilipGehrcke python only supports the `fork` method under POSIX, so compatibility isn't an issue. – max Mar 23 '15 at 10:47
1

@max I guess you do not really know what you are writing about, and now I am entirely confused and cannot really help. I try anyway: `fork()` is a system call, which is something that the operating system has to provide. Python, or any other application, can *use* it. Windows does not provide `fork()`. So, *"python only supports the fork method under POSIX"* is half wrong and half obvious, and *"compatibility isn't an issue"* tells me nothing about what you are struggling with now. It *is* an issue, considering that Windows does not support `fork()`. – Dr. Jan-Philip Gehrcke Mar 23 '15 at 14:26
1

@Jan-PhilipGehrcke I meant that I'm asking about `'fork'`, which is one of several start methods supported by python's `multiprocessing.Process`, and which is documented as only supported under POSIX; you can't use it on Windows at all. So Python is not constrained by any cross-platform issues when using `'fork'` start method. My question was why in this situation, Python wouldn't call the `target` function and then `os.fork()` instead of `fork`ing before the call and therefore serializing the argument to `target`. – max Mar 23 '15 at 18:06
2

We are all aware of that `fork()` is not available on Windows, it has been stated in my answer and multiple times in the comments. I know that this was your initial question, and I answered it four comments above *this*: "the compromise is to use the same method of parameter transfer on both platforms by default, for better maintainability and for ensuring equal behavior.". Both ways have their advantages and disadvantages, which is why in Python 3 there is greater flexibility for the user to choose the method. This discussion is not productive w/o talking details, which we should not do here. – Dr. Jan-Philip Gehrcke Mar 23 '15 at 18:49
2

What about the `global` keyword, is it required? – CMCDragonkai Apr 24 '17 at 14:54

score 31 · Accepted Answer · edited May 23 '17 at 12:10

@Velimir Mlaker gave a great answer. I thought I could add some bits of comments and a tiny example.

(I couldn't find much documentation on sharedmem - these are the results of my own experiments.)

Do you need to pass the handles when the subprocess is starting, or after it has started? If it's just the former, you can just use the target and args arguments for Process. This is potentially better than using a global variable.
From the discussion page you linked, it appears that support for 64-bit Linux was added to sharedmem a while back, so it could be a non-issue.
I don't know about this one.
No. Refer to example below.

Example

#!/usr/bin/env python
from multiprocessing import Process
import sharedmem
import numpy

def do_work(data, start):
    data[start] = 0;

def split_work(num):
    n = 20
    width  = n/num
    shared = sharedmem.empty(n)
    shared[:] = numpy.random.rand(1, n)[0]
    print "values are %s" % shared

    processes = [Process(target=do_work, args=(shared, i*width)) for i in xrange(num)]
    for p in processes:
        p.start()
    for p in processes:
        p.join()

    print "values are %s" % shared
    print "type is %s" % type(shared[0])

if __name__ == '__main__':
    split_work(4)

Output

values are [ 0.81397784  0.59667692  0.10761908  0.6736734   0.46349645  0.98340718
  0.44056863  0.10701816  0.67167752  0.29158274  0.22242552  0.14273156
  0.34912309  0.43812636  0.58484507  0.81697513  0.57758441  0.4284959
  0.7292129   0.06063283]
values are [ 0.          0.59667692  0.10761908  0.6736734   0.46349645  0.
  0.44056863  0.10701816  0.67167752  0.29158274  0.          0.14273156
  0.34912309  0.43812636  0.58484507  0.          0.57758441  0.4284959
  0.7292129   0.06063283]
type is <type 'numpy.float64'>

This related question might be useful.

Velimir Mlaker · Answer 3 · 2013-07-29T17:48:11.687

You may be interested in a tiny piece of code I wrote: github.com/vmlaker/benchmark-sharedmem

The only file of interest is main.py. It's a benchmark of numpy-sharedmem -- the code simply passes arrays (either numpy or sharedmem) to spawned processes, via Pipe. The workers just call sum() on the data. I was only interested in comparing the data communication times between the two implementations.

I also wrote another, more complex code: github.com/vmlaker/sherlock.

Here I use the numpy-sharedmem module for real-time image processing with OpenCV -- the images are NumPy arrays, as per OpenCV's newer cv2 API. The images, actually references thereof, are shared between processes via the dictionary object created from multiprocessing.Manager (as opposed to using Queue or Pipe.) I'm getting great performance improvements when compared with using plain NumPy arrays.

Pipe vs. Queue:

In my experience, IPC with Pipe is faster than Queue. And that makes sense, since Queue adds locking to make it safe for multiple producers/consumers. Pipe doesn't. But if you only have two processes talking back-and-forth, it's safe to use Pipe, or, as the docs read:

... there is no risk of corruption from processes using different ends of the pipe at the same time.

sharedmem safety:

The main issue with sharedmem module is the possibility of memory leak upon ungraceful program exit. This is described in a lengthy discussion here. Although on Apr 10, 2011 Sturla mentions a fix to memory leak, I have still experienced leaks since then, using both repos, Sturla Molden's own on GitHub (github.com/sturlamolden/sharedmem-numpy) and Chris Lee-Messer's on Bitbucket (bitbucket.org/cleemesser/numpy-sharedmem).

Thanks, very very informative. The memory leak in `sharedmem` sounds like a big deal, though. Any leads on solving that? — Will, Jul 29 '13 at 05:18
Beyond just noticing the leaks, I haven't looked for it in the code. I added to my answer, under "sharedmem safety" above, keepers of the two open source repos of the ``sharedmem`` module, for reference. — Velimir Mlaker, Jul 29 '13 at 17:54

score 18 · Answer 4 · edited May 23 '17 at 10:31

If your array is that big you can use numpy.memmap. For example, if you have an array stored in disk, say 'test.array', you can use simultaneous processes to access the data in it even in "writing" mode, but your case is simpler since you only need "reading" mode.

Creating the array:

a = np.memmap('test.array', dtype='float32', mode='w+', shape=(100000,1000))

You can then fill this array in the same way you do with an ordinary array. For example:

a[:10,:100]=1.
a[10:,100:]=2.

The data is stored into disk when you delete the variable a.

Later on you can use multiple processes that will access the data in test.array:

# read-only mode
b = np.memmap('test.array', dtype='float32', mode='r', shape=(100000,1000))

# read and writing mode
c = np.memmap('test.array', dtype='float32', mode='r+', shape=(100000,1000))

Share Large, Read-Only Numpy Array Between Multiprocessing Processes

6 Answers6

Example

Output

Linked

Related