I am assuming you are able to load the whole dataset into RAM in a numpy array, and that you are working on Linux or a Mac. (If you are on Windows or you can't fit the array into RAM, then you should probably copy the array to a file on disk and use numpy.memmap to access it. Your computer will cache the data from disk into RAM as well as it can, and those caches will be shared between processes, so it's not a terrible solution.)
Under the assumptions above, if you need read-only access to the dataset in other processes created via multiprocessing
, you can simply create the dataset and then launch the other processes. They will have read-only access to data from the original namespace. They can alter data from the original namespace, but those changes won't be visible to other processes (the memory manager will copy each segment of memory they alter into the local memory map).
If your other processes need to alter the original dataset and make those changes visible to the parent process or other processes, you could use something like this:
import multiprocessing
import numpy as np
# create your big dataset
big_data = np.zeros((3, 3))
# create a shared-memory wrapper for big_data's underlying data
# (it doesn't matter what datatype we use, and 'c' is easiest)
# I think if lock=True, you get a serialized object, which you don't want.
# Note: you will need to setup your own method to synchronize access to big_data.
buf = multiprocessing.Array('c', big_data.data, lock=False)
# at this point, buf and big_data.data point to the same block of memory,
# (try looking at id(buf[0]) and id(big_data.data[0])) but for some reason
# changes aren't propagated between them unless you do the following:
big_data.data = buf
# now you can update big_data from any process:
def add_one_direct():
big_data[:] = big_data + 1
def add_one(a):
# People say this won't work, since Process() will pickle the argument.
# But in my experience Process() seems to pass the argument via shared
# memory, so it works OK.
a[:] = a+1
print "starting value:"
print big_data
p = multiprocessing.Process(target=add_one_direct)
p.start()
p.join()
print "after add_one_direct():"
print big_data
p = multiprocessing.Process(target=add_one, args=(big_data,))
p.start()
p.join()
print "after add_one():"
print big_data