I'm doing something like this:
from multiprocessing import Process, Queue
def func(queue):
# do stuff to build up sub_dict
queue.put(sub_dict)
main_dict = {}
num_processes = 16
processes = []
queue = Queue()
for i in range(num_processes):
proc = Process(target=func)
processes.append(proc)
proc.start()
for proc in processes:
main_dict.update(queue.get())
for proc in processes:
proc.join()
The sub_dicts
are something like 62,500 keys long, and each value is a several page document of words split into a numpy array.
What I've found is that the whole script tends to get stuck a lot towards the end of the executions of func
. func
takes about 25 minutes to run in each process (and I have 16 cores), but then I need to wait another hour before everything is done.
On another post commenters suggested that it's probably because of the overhead of the multiprocessing. That is, those huge sub_dict
s need to be pickled and unpickled to rejoin the main process.
Apart from me coming up with my own data compression scheme, are there any handy ways to get around this problem?
More context
What I'm doing here is chunking a really large array of file names into 16 pieces and sending them to func
. Then func opens those files, extracts the content, preprocesses it, and puts it in a sub_dict
with {filename: content}
. Then that sub_dict
comes back to the main process to be added into main_dict
. It's not the pickling of the original array chunks that's expensive. It's the pickling of the incoming sub_dicts
EDIT
Doesn't solve the actual question here, but I found out what my real issue was. I was running into swap memory because I underestimated the usage as compared to the relatively smaller disk space of the dataset I was processing. Doubling the memory on my VM sorted the main issue.