python3 multiprocess shared numpy array(read-only)

Question

I'm not sure if this title is appropriate for my situation: the reason why I want to share numpy array is that it might be one of the potential solutions to my case, but if you have other solutions that would also be nice.

My task: I need to implement an iterative algorithm with multiprocessing, while each of these processes need to have a copy of data(this data is large, and read-only, and won't change during the iterative algorithm).

I've written some pseudo code to demonstrate my idea:

import multiprocessing


def worker_func(data, args):
    # do sth...
    return res

def compute(data, process_num, niter):
    data
    result = []
    args = init()

    for iter in range(niter):
        args_chunk = split_args(args, process_num)
        pool = multiprocessing.Pool()
        for i in range(process_num):
            result.append(pool.apply_async(worker_func,(data, args_chunk[i])))
        pool.close()
        pool.join()
        # aggregate result and update args
        for res in result:
            args = update_args(res.get())

if __name__ == "__main__":
    compute(data, 4, 100)

The problem is in each iteration, I have to pass the data to subprocess, which is very time-consuming.

I've come up with two potential solutions:

share data among processes (it's ndarray), that's the title of this question.
Keep subprocess alive, like a daemon process or something...and wait for call. By doing that, I only need to pass the data at the very beginning.

So, is there any way to share a read-only numpy array among process? Or if you have a good implementation of solution 2, it also works.

Thanks in advance.

score 5 · Accepted Answer · answered Feb 07 '19 at 23:21

If you absolutely must use Python multiprocessing, then you can use Python multiprocessing along with Arrow's Plasma object store to store the object in shared memory and access it from each of the workers. See this example, which does the same thing using a Pandas dataframe instead of a numpy array.

If you don't absolutely need to use Python multiprocessing, you can do this much more easily with Ray. One advantage of Ray is that it will work out of the box not just with arrays but also with Python objects that contain arrays.

Under the hood, Ray serializes Python objects using Apache Arrow, which is a zero-copy data layout, and stores the result in Arrow's Plasma object store. This allows worker tasks to have read-only access to the objects without creating their own copies. You can read more about how this works.

Here is a modified version of your example that runs.

import numpy as np
import ray

ray.init()

@ray.remote
def worker_func(data, i):
    # Do work. This function will have read-only access to
    # the data array.
    return 0

data = np.zeros(10**7)
# Store the large array in shared memory once so that it can be accessed
# by the worker tasks without creating copies.
data_id = ray.put(data)

# Run worker_func 10 times in parallel. This will not create any copies
# of the array. The tasks will run in separate processes.
result_ids = []
for i in range(10):
    result_ids.append(worker_func.remote(data_id, i))

# Get the results.
results = ray.get(result_ids)

Note that if we omitted the line data_id = ray.put(data) and instead called worker_func.remote(data, i), then the data array would be stored in shared memory once per function call, which would be inefficient. By first calling ray.put, we can store the object in the object store a single time.

Nice solution! Personally I would prefer `ray`, very neat solution with only few lines of code. It seems that ray will schedule the multiprocessing part for me. — Ziqi Liu, Feb 08 '19 at 04:42
Cool! How would this work if we want to pass a list of arguments, lets say we have `x = [1,2,3,4,5,6,7,8,9,10,11,12]`. Cause now you just pass `i`, but what if we need to pass the items from `x`? Perhaps now we could do `x[i]` however we have 12 items not 10 for example — CodeNoob, Jan 15 '21 at 16:16

Pavan Chandaka · Answer 2 · 2019-02-07T19:54:05.740

0

Conceptually for your problem, using mmap is a standard way. This way, the information can be retrieved from mapped memory by multiple processes

Basic understanding of mmap:

https://en.wikipedia.org/wiki/Mmap

Python has "mmap" module(import mmap)

The documentation of python standard and some examples are in below link

https://docs.python.org/2/library/mmap.html

edited Feb 07 '19 at 19:54

answered Feb 07 '19 at 19:48

Pavan Chandaka

11,671
5
26
34

python3 multiprocess shared numpy array(read-only)

2 Answers2

Linked