I am trying to exploit parallelization in parsing data with Python 2.7 using the multiprocessing library. The task at hand is reading lots of large data files and returning their content as a dictionary or list. The problem is that when I try to return from my function, sending back the parsed data, my machine (running in Ubuntu Linux) hangs, with the memory and load indicators being at a maximum.
The code does something like this
import multiprocessing as mp
def worker(filex):
""" reading lots of data """
raw = filter(lambda x: len(x.split())>3,
open(filex).readlines())
data = {}
... # putting all the data in the data dictionary
return data
# multiprocessing options
nproc = mp.cpu_count()
pool = mp.Pool(processes=nproc)
traj = pool.map(worker, tuple(files_to_parse))
pool.close()
The large data structure is the one that creates the problem. Interestingly if I return something else the code works, and even if I pass that data structure but being empty. Using data as a list instead of dictionary did not help.
Any suggestions?