How to use large dataset in multiple processes without copying it?

Question

I need to run multiple (1000-10000s) search queries on a large dataset (>10GB) in python. To speed things up, I want to run the individual queries in parallel. However, as far as I understand, parsing the dataset to different processes copies it increasing the memory requirements, which soon becomes infeasible. So, I want to ask the community, is it possible to parse (process) a large dataset in multiple parallelly running processes (functions) without increasing the memory usage?

Below is a sample script. Here, with increasing n, the memory usage increases and becomes restrictive soon.

from multiprocessing import Pool
import sys

def goo(d):
    for k,v in d.items():
        print(k,len(v))

d = {'a':[3]*(10**5),
     'b':[6]*(10**8)}

n = int(sys.argv[1])

with Pool(processes=n) as pool:
    pool.starmap(goo, [[d]]*n)

Edit: Just to clarify, this is part of a tool that will be shared with other people working on different platforms and environments. So, I need something python specific as I do not want to make the solution dependent on external dependencies.

For a dataset that large you should consider setting up a framework that is designed to handle the scale. MySQL or Apache Arrow pulling parquet files would be a good place to start. — James, Aug 21 '19 at 10:31
but you have *a large dataset* beforehand, it's not accumulated dynamically via `n` arg, it that right? — RomanPerekhrest, Aug 21 '19 at 10:31
@James This is part of a tool which will be shared with other people. So, I would need some python based solution in order to avoid dependencies. — Manish Goel, Aug 21 '19 at 11:22

score 0 · Answer 1 · answered Aug 21 '19 at 10:48

If you already have the dataset in memory and you want to avoid sending copies to another processes, just use shared memory as the IPC mechanism if it's available in your OS.

Googling python shmget (shmget is the system call in Linux) yields this library that might be useful for you sysv_ipc

score 0 · Answer 2 · answered Aug 22 '19 at 06:19

First option is to use a Manager that allows you to share your dictionary across processes. See this example on how to do that.

Your second option is to create your processes with queues instead of data. This is my prefered way to handle big data. The idea behind is that the data is only accessed by the main thread which feeds it into a task queue where the data is then picked from by the sub-processes. It is in a way what Pool.map() is doing, but with queues it consumes much less memory and is equaly fast. It takes a bit more code to achieve that, but you have much more control. Here is an example on how to do that.

score 0 · Answer 3 · answered Aug 22 '19 at 12:24

I solved my issue using the global variable for the dictionary.

from multiprocessing import Pool
import sys

def goo(i):
    global d
    for k,v in d.items():
        print(k,len(v))

d = {'a':[3]*(10**5),
     'b':[6]*(10**8)}

n = int(sys.argv[1])

with Pool(processes=n) as pool:
    pool.map(goo, range(n))

I do not know exactly why this works as I have read at multiple places that each process gets a 'copy' of the main memory but using global variable seems to be not copying the dict.

How to use large dataset in multiple processes without copying it?

3 Answers3