0

For each of the 4 processes of my parallelized big-computation job, I would like to test if some number is member of a big 4GB set S.

When using this method, the problem is that t = Process(target=somefunc,args=(S,)) passes the 4GB of data to each process, and this is too big for my computer (4*4 = 16 GB)!

How to use S as a global variable in this multiprocessing job, instead of having to pass (and duplicate) S to each process?

from multiprocessing import Process
from random import randint

def somefunc(S):
    a = randint(0, 100)           # simplified example
    print a in S
    return 

def main():
    S = set([1, 2, 7, 19, 13])   # here it's a 4 GB set in my real program

    for i in range(4):
       t = Process(target=somefunc,args=(S,))
       t.start()
    t.join()

if __name__ == '__main__':
    main()

Note: I've already thought about using a database + client/server (or even just SQlite), but I really want to use the speed of set/dict lookup, which is faster (in order of magnitude) than database call.

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 1
    Related: https://stackoverflow.com/q/14124588/3901060 – FamousJameous Jan 03 '18 at 22:15
  • Can you use threads and hold 4GB in your system? (this are two questions, actually) Because one thing that comes to mind is using threads that do share memory (as opposed to processes where you're forced to feed each 4GB dataset to each process) – Savir Jan 03 '18 at 22:16
  • 1
    I'd rather split the set into chunks and do the lookup for a single number in different processes (for each chunk) at the same time. – tamasgal Jan 03 '18 at 22:17
  • Use a [Bloom filter](https://pypi.python.org/pypi/bloom-filter/1.3)? Bloom filters can have false positives, so use this data structure to produce a candidate list, then pass this (presumably much shorter) list through the traditional set in a single process to remove the false positives. – kindall Jan 03 '18 at 22:20
  • @kindall Still with a bloom filter, you need to share it (and duplicate in RAM) to the different processes, right? Maybe I missed something, but how does such a filter solve the share-variable-to-different-process problem? Thanks for pointing a bloom filter by the way. – Basj Jan 03 '18 at 22:21
  • @BorrajaX: no, with threads, it will still only 1 core of my 4-core CPU, which is a shame. – Basj Jan 03 '18 at 22:23
  • It's true that you'd still need to share a Bloom filter among processes, but it should be much smaller. – kindall Jan 03 '18 at 22:47

1 Answers1

1

What about using joblib.Parallel?

S = set(range(50000000)  # ~ 3.5 Gb

def somefunc():
    a = randint(0, 100)           # simplified example
    print a in S
    return 

def main():
    out = Parallel(n_jobs=4, verbose=1)(
        delayed(somefunc)() for i in range(50))

if __name__ == '__main__':
    main()

I may be out of my league here, but this doesn't duplicate the set. In testing this out the memory used by Python3.6 for this script never breaches 4 Gb.

Alternatively you could just use S as a global variable without passing it to somefunc:

from multiprocessing import Process
from random import randint

def somefunc():
    a = randint(0, 100)
    print(a in S)
    return

def main():  # here it's a 4 GB set in my real program

    for i in range(4):
       t = Process(target=somefunc)
       t.start()
    t.join()

if __name__ == '__main__':
    S = set(range(50000000))
    main()

As far as I can tell from testing both of these methods produce the correct output and neither duplicate S.

Grr
  • 15,553
  • 7
  • 65
  • 85
  • It seems to work, however [this leads](https://pythonhosted.org/joblib/parallel.html) to a probelm: `Under Windows, it is important to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.Parallel. In other words, you should be writing code like this: [...] No code should run outside of the “if __name__ == ‘__main__’” blocks, only imports and definitions.`. – Basj Jan 03 '18 at 22:44
  • @Basj Perhaps I have missed something. Aside from the definition of `S` what exactly is running outside of the main loop? – Grr Jan 03 '18 at 22:47
  • @Grr: The code is reasonable. However, on Windows, each process will need to independently create a new 4GB `S`, because each `Process` instance loads that main.py file. There's no `fork` so no chance to share one big copy-on-write `S` across the four subprocesses. (`S` won't exist in the sub-processes, in the second version where `S` is set under `if __name__ == '__main__':`.) – torek Jan 04 '18 at 01:58