1

I'm doing something like this:

from multiprocessing import Process, Queue

def func(queue):
    # do stuff to build up sub_dict
    queue.put(sub_dict)

main_dict = {}
num_processes = 16
processes = []
queue = Queue()
for i in range(num_processes):
    proc = Process(target=func)
    processes.append(proc)
    proc.start()

for proc in processes:
    main_dict.update(queue.get())

for proc in processes:
    proc.join()

The sub_dicts are something like 62,500 keys long, and each value is a several page document of words split into a numpy array.

What I've found is that the whole script tends to get stuck a lot towards the end of the executions of func. func takes about 25 minutes to run in each process (and I have 16 cores), but then I need to wait another hour before everything is done.

On another post commenters suggested that it's probably because of the overhead of the multiprocessing. That is, those huge sub_dicts need to be pickled and unpickled to rejoin the main process.

Apart from me coming up with my own data compression scheme, are there any handy ways to get around this problem?

More context

What I'm doing here is chunking a really large array of file names into 16 pieces and sending them to func. Then func opens those files, extracts the content, preprocesses it, and puts it in a sub_dict with {filename: content}. Then that sub_dict comes back to the main process to be added into main_dict. It's not the pickling of the original array chunks that's expensive. It's the pickling of the incoming sub_dicts

EDIT

Doesn't solve the actual question here, but I found out what my real issue was. I was running into swap memory because I underestimated the usage as compared to the relatively smaller disk space of the dataset I was processing. Doubling the memory on my VM sorted the main issue.

Alexander Soare
  • 2,825
  • 3
  • 25
  • 53
  • You could chunk the large dict and then instead of sending the large dict to one process, only send chunks of a dict. see here how to chunk it https://stackoverflow.com/questions/22878743/how-to-split-dictionary-into-multiple-dictionaries-fast – drops Aug 13 '20 at 07:40
  • @drops I think I get you. What I'm doing here is chunking a really large array of file names into 16 pieces and sending them to `func`. Then func opens those files, extracts the content, preprocesses it, and puts it in a dict with `{filename: content}`. Then that dict needs to come back to the main process. Not sure what further chunking would do, or am I missing somthing? – Alexander Soare Aug 13 '20 at 07:44
  • "Apart from me coming up with my own data compression scheme, are there any handy ways to get around this problem?" No, not really. Sharing state and multiprocessing is always hairy. I would give it a shot with a `Pool` and `Pool.map` though – juanpa.arrivillaga Aug 13 '20 at 07:49
  • @juanpa.arrivillaga thanks for pitching in again! So does pool not do the pickle/depickle to pass objects around? While we're at it, what do you think the best resources is to learn about this sort of thing? – Alexander Soare Aug 13 '20 at 07:52
  • 1
    No, it does, that is *always* going to have to happen with multiprocessing. But, it manages a pool of workers for you, and does the chunking. As for further reading, I would start with the documentation, and pay close attention to [this section](https://docs.python.org/3.8/library/multiprocessing.html#sharing-state-between-processes) – juanpa.arrivillaga Aug 13 '20 at 08:08
  • So, the shared memory approach may be amenable to your use-case. But that really depends on specifics. – juanpa.arrivillaga Aug 13 '20 at 08:09
  • Great very appreciated. That makes sense. Since these are words I'm encoding them to indices in np.uint32 which cuts down the size a lot. I suppose it is very problem dependent. – Alexander Soare Aug 13 '20 at 08:13
  • If you are running into memory issues, you can try pickling the sub_dict to a file on local hard drive, and pass the file location in the queue instead This will save memory since main process can open and add the sub_dict whenever the process complete – Charchit Agarwal Oct 16 '22 at 21:17

0 Answers0