I have been reading on about how to share a large object in parent process to which child processes need read-only
access. Perhaps I am missing some important posts. I don't seem to find a clean answer for operations that depend on the whole object (always chunking and distributing to workers or piping through filters).
Let's consider the following problem. I have some 300-dimension vectors of floats, one million of them. Let's say they are loaded into numpy
arrays of float32
dtype. Each vector is associated with a key of string
.
Let's store this key-value (string-array) relation in a dict
. Now I have a large in-memory object.
Let's consider my parent process to be a server
process. It catches queries of the form "key1,key2"
. It is expected to calculate and return the pair-wise Euclidean distances between key1 and key2. Child processes are spawned to handle multiple queries concurrently.
One cannot chunk the dictionary, because user queries can be of any kind. Also child need to handle the case of key-not-found, etc.
How do I handle child processes that depend in read-only mode on the whole dictionary in multiprocessing
?
Updates:
- Numbers in this question are hypothetical
dict
can be replaced by any object that is hashtable-like and maintain the key-value relation- It will be nice if I can have the object read and deserialized only once