3

I have been reading on about how to share a large object in parent process to which child processes need read-only access. Perhaps I am missing some important posts. I don't seem to find a clean answer for operations that depend on the whole object (always chunking and distributing to workers or piping through filters).

Let's consider the following problem. I have some 300-dimension vectors of floats, one million of them. Let's say they are loaded into numpy arrays of float32 dtype. Each vector is associated with a key of string.

Let's store this key-value (string-array) relation in a dict. Now I have a large in-memory object.

Let's consider my parent process to be a server process. It catches queries of the form "key1,key2". It is expected to calculate and return the pair-wise Euclidean distances between key1 and key2. Child processes are spawned to handle multiple queries concurrently.

One cannot chunk the dictionary, because user queries can be of any kind. Also child need to handle the case of key-not-found, etc.

How do I handle child processes that depend in read-only mode on the whole dictionary in multiprocessing?

Updates:

  1. Numbers in this question are hypothetical
  2. dict can be replaced by any object that is hashtable-like and maintain the key-value relation
  3. It will be nice if I can have the object read and deserialized only once
Patrick the Cat
  • 2,138
  • 1
  • 16
  • 33
  • One thousand 200-length vectors in single precision is only about 800 kB. Just send it through a `multiprocessing.Queue`? Or are these figures just an example and is the actual data a lot bigger? –  Nov 22 '15 at 21:46
  • @morningsun Let's have one million of 300d vec. Numbers in this question are hypothetical. – Patrick the Cat Nov 22 '15 at 21:47
  • Is there any reason why the data need to be stored in a dict? HDF5 and `np.memmap` support parallel reads. – ali_m Nov 22 '15 at 22:00
  • @ali_m I do need a hashtable-like object for key-value relations. For this reason `np.memmap` is not entirely useful. It will be nice if you can answer with how to use HDF5 and query into it in child process without having large memory overhead. – Patrick the Cat Nov 22 '15 at 22:09
  • You could memmap a structured array, and h5py's interface is also dict-like. In fact, I suppose you could use a standard dict that just holds references to data stored on disk, e.g. in a memmaped array/HDF5 array etc. – ali_m Nov 22 '15 at 22:20
  • @ali_m If that's possible, then it is one way to solve this. But it also means I have to do two disk-IO for each query. The IO overhead may be large. Is it possible to read all key-vector pairs once into memory and then share without copying? – Patrick the Cat Nov 22 '15 at 22:42
  • Have you seen this method? http://stackoverflow.com/a/17786444 It's probably the easiest way, IF you're on Linux/OSX and the data set is fixed before spawning the child processes. –  Nov 22 '15 at 22:45
  • It's probably not as bad as you think - `np.memmap` and HDF5 will do their own caching into RAM, so repeated reads would not necessary require any IO. – ali_m Nov 22 '15 at 22:49
  • I think more specifics are needed like 1) what is your expected throughput?(how many of these pair-wise distance calculations per second do you need to do?) 2) instead of talking about "examples", the specifics of your *actual* data do matter a lot...what are the dimensions? how many? and 3) do you use a SSD 4) do you have a GPU? 5) is your data sparse? All of these questions will lead to different answers. – user1269942 Dec 05 '18 at 23:34

1 Answers1

-1

You can use dict() from Manager, as the following code:

from multiprocessing import Process, Manager


def f(d, l):
    d[1] = '1'
    d['2'] = 2
    d[0.25] = None
    l.reverse()

if __name__ == '__main__':
    with Manager() as manager:
        d = manager.dict()
        l = manager.list(range(10))

        p = Process(target=f, args=(d, l))
        p.start()
        p.join()

        print(d)
        print(l)

It prints:

{0.25: None, 1: '1', '2': 2} [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

The dict() is sharable between multiple processes.

soulmachine
  • 3,917
  • 4
  • 46
  • 56