I need to run multiple (1000-10000s) search queries on a large dataset (>10GB) in python. To speed things up, I want to run the individual queries in parallel. However, as far as I understand, parsing the dataset to different processes copies it increasing the memory requirements, which soon becomes infeasible. So, I want to ask the community, is it possible to parse (process) a large dataset in multiple parallelly running processes (functions) without increasing the memory usage?
Below is a sample script. Here, with increasing n
, the memory usage increases and becomes restrictive soon.
from multiprocessing import Pool
import sys
def goo(d):
for k,v in d.items():
print(k,len(v))
d = {'a':[3]*(10**5),
'b':[6]*(10**8)}
n = int(sys.argv[1])
with Pool(processes=n) as pool:
pool.starmap(goo, [[d]]*n)
Edit: Just to clarify, this is part of a tool that will be shared with other people working on different platforms and environments. So, I need something python specific as I do not want to make the solution dependent on external dependencies.