2

I have read about this post, Python multiprocessing: sharing a large read-only object between processes?, but still not sure how to proceed next.

Here is my problem:

I am analysing an array of millions of strings using multiprocessing, and each string need to be checked against a big dict which consists of about 2 million (maybe higher) keys. Its values are objects of customized Python class called Bloomfilter (so they're not just simple int or float, or array), and their sizes vary from a few bytes to 1.5 Gb. The analysis for each string is basically to check whether a string is in a certain number of bloomfilters in the dictionary. It depends on the string itself to decide which bloomfilters are relevant. The dictionary is a transformation of a 30G sqlite3 db. The motivation is to load the whole sqlite3 db into memory to speed up processing, but I haven't found a way to share the dict effectively. I have about 100G memory in my system.

Here is what I have tried:

The analysis for each string is CPU-bound, so I chose multiprocessing over multithreading. The key is how to share the big dict among the processes without copying. multiprocess.Value and multiprocessing.Array cannot deal with complex objects like a dict. I have tried multiprocessing.Manager(), but since the dict is so big that I get IOError: bad message length error. I have also tried using a in memory database like Redis on localhost, but the bitarray, which is used to construct a Bloomfilter after being fetched, is too big to fit in, either, which makes me think passing big messages among processes is just too expensive (is it?)

My Question:

What is the right way to share such the dictionary among different processes (or threads if there is a way to circumvent GIL)? If I need to use a database, which one should I use? I need very fast read and the database should be able to store very big values. (Though I don't think database would work because passing around very big values won't work, right? Please correct me if I am wrong)

Community
  • 1
  • 1
zyxue
  • 7,904
  • 5
  • 48
  • 74
  • 1
    Are you using Unix? You could just fork() as many times as needed, which would give you a copy of the data structure in each process. If you're using windows, it's probably more complex, as I don't think fork is implemented quite like that. – Max Sep 17 '15 at 18:36
  • Yes, I am on linux. Could you talk a bit more details about ``fork()``, or please point me to some reference? – zyxue Sep 17 '15 at 18:37
  • Ok, keep in mind that fork is a very low level tool, but it's in the os module: https://docs.python.org/2/library/os.html#os.fork (same idea in python 3). It literally clones the process, so they have all the same data; if you do not do any write operations the data structure afterwards, the processes should share the relevant memory pages. You may then have to write your own methods to tally up your calculated data at the end. – Max Sep 17 '15 at 18:40
  • Looks interesting. I am really out of ideas, and will take a look. I have heard about the copy-on-write thing. – zyxue Sep 17 '15 at 18:41
  • It's really low level, is there any other option? – zyxue Sep 17 '15 at 18:54
  • 1
    On Unix Multiprocessing uses fork to create sub processes. So if you only need read access, just make the dictionary a global and then carry on as normal. – Dunes Sep 18 '15 at 07:13

1 Answers1

2

It turns out that both @Max and @Dunes are correct, but I don't need to either os.fork() directly or a global variable. Some pseudo-code is shown as below, as long as big_dict isn't modified in the worker, there appears to be only one copy in the memory. However, I am not sure if this copy-on-write feature is universal in the unix-like OS world. The OS I am running my code is CentOS release 5.10 (Final).

from multiprocessing import Process, Lock

def worker(pid, big_dict, lock):
    # big_dict MUST NOT be modified in the worker because of copy-on-write
    pass
    # do some heavy work

def main():
    big_dict = init_a_very_big_dict()

    NUM_CPUS = 24
    lock = Lock()
    procs = []
    for pid in range(NUM_CPUS):
        proc = Process(target=worker, args=(pid, big_dict, lock))
        proc.daemon = True
        procs.append(proc)
        proc.start()

    for proc in procs:
        proc.join()
zyxue
  • 7,904
  • 5
  • 48
  • 74
  • 1
    As a note: this probably doesn't work on windows, which doesn't have process forking. – Max Sep 18 '15 at 20:03
  • @Max I have exactly the same situation and it used to run on our debian server, but now it does not work anymore... Even using the Manager (which does not give an error in my case), the memory is growing and growing... It looks like only some part of the data is being copied as the size grows slowly but steadily while the CPUs are working at up to 100%. Maybe some buffer + missing garbage collection? – Radio Controlled Jan 18 '17 at 10:15