I have read about this post, Python multiprocessing: sharing a large read-only object between processes?, but still not sure how to proceed next.
Here is my problem:
I am analysing an array of millions of strings using multiprocessing
, and each string need to be checked against a big dict which consists of about 2 million (maybe higher) keys. Its values are objects of customized Python class called Bloomfilter
(so they're not just simple int or float, or array), and their sizes vary from a few bytes to 1.5 Gb. The analysis for each string is basically to check whether a string is in a certain number of bloomfilters in the dictionary. It depends on the string itself to decide which bloomfilters are relevant. The dictionary is a transformation of a 30G sqlite3 db. The motivation is to load the whole sqlite3 db into memory to speed up processing, but I haven't found a way to share the dict effectively. I have about 100G memory in my system.
Here is what I have tried:
The analysis for each string is CPU-bound, so I chose multiprocessing over multithreading. The key is how to share the big dict among the processes without copying. multiprocess.Value
and multiprocessing.Array
cannot deal with complex objects like a dict. I have tried multiprocessing.Manager()
, but since the dict is so big that I get IOError: bad message length
error. I have also tried using a in memory database like Redis on localhost, but the bitarray, which is used to construct a Bloomfilter after being fetched, is too big to fit in, either, which makes me think passing big messages among processes is just too expensive (is it?)
My Question:
What is the right way to share such the dictionary among different processes (or threads if there is a way to circumvent GIL)? If I need to use a database, which one should I use? I need very fast read and the database should be able to store very big values. (Though I don't think database would work because passing around very big values won't work, right? Please correct me if I am wrong)