2

Is it efficient to calculate many results in parallel with multiprocessing.Pool.map() in a situation where each input value is large (say 500 MB), but where input values general contain the same large object? I am afraid that the way multiprocessing works is by sending a pickled version of each input value to each worker process in the pool. If no optimization is performed, this would mean sending a lot of data for each input value in map(). Is this the case? I quickly had a look at the multiprocessing code but did not find anything obvious.

More generally, what simple parallelization strategy would you recommend so as to do a map() on say 10,000 values, each of them being a tuple (vector, very_large_matrix), where the vectors are always different, but where there are say only 5 different very large matrices?

PS: the big input matrices actually appear "progressively": 2,000 vectors are first sent along with the first matrix, then 2,000 vectors are sent with the second matrix, etc.

Community
  • 1
  • 1
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • I think that [Python multiprocessing: sharing a large read-only object between processes?](http://stackoverflow.com/q/659865/1132524) might have the answers you're looking for. – Rik Poggi Apr 20 '12 at 08:22
  • Thanks for your input. The difference between the question you refer to and this one is that the large matrices are essentially *input values* for map (the question linked to only uses a single, big object). Furthermore, it does not look like any of the solutions in the top answer is adapted to the case of this question. – Eric O. Lebigot Apr 20 '12 at 08:28

2 Answers2

1

I think that the obvious solution is to send a reference to the very_large_matrix instead of a clone of the object itself? If there are only five big matrices, create them in the main process. Then when the multiprocessing.Pool is instantiated it will create a number of child processes that clones the parent process' address space. That means that if there are six processes in the pool, there will be (1 + 6) * 5 different matrices in memory simultaneously.

So in the main process create a lookup of all unique matrices:

matrix_lookup = {1 : matrix(...), 2 : matrix(...), ...}

Then pass the index of each matrix in the matrix_lookup along with the vectors to the worker processes:

p = Pool(6)
pool.map(func, [(vector, 1), (vector, 2), (vector, 1), ...])
Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122
  • This sounds like a good idea. I have one question though: will the dreaded *copy on write* curse eventually force a copy of each large matrix in each process? I have read that Python's object reference counts are stored along with the object: incrementing the reference count through referring to an object copies the whole object into the forked process (under most operating systems–Windows is worse, if I understand correctly). – Eric O. Lebigot Apr 21 '12 at 03:24
  • Yes, the dreaded copy on write will do that. However, having 5*(P+1) matrices in memory (where P is the number of processes in the pool) is significantly less than one matrix for each item in the list passed to map(). – Björn Lindqvist Apr 22 '12 at 18:44
  • Thanks for the confirmation about copy-on-write. I'm not sure I understand why "item in the list passed to map()" use "significantly more" memory than having 5*(P+1) matrices in memory. In fact, the memory footprint of the list is essentially the size of the *5 matrices*, which is *less* that the size of 5*(P+1) matrices (the reason is that Python internally uses pointers to the 5 matrices). – Eric O. Lebigot Apr 23 '12 at 03:57
1

I hit a similar issue: parallelizing calculations on a big dataset. As you mentioned multiprocessing.Pool.map pickles the arguments. What I did was to implement my own fork() wrapper that only pickles the return values back to the parent process, hence avoiding pickling the arguments. And a parallel map() on top of the wrapper.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • Did all your processes end up duplicating your big dataset? If I understand correctly, you had *one* big dataset: the question is about having a few big sets (there are already solutions available for a single big dataset at http://stackoverflow.com/q/659865/1132524–you might even want to add yours). – Eric O. Lebigot Apr 21 '12 at 03:33
  • 1
    PS: Here is a reference to this "copy on write" memory duplication for read-only objects: http://stackoverflow.com/a/660026/42973. – Eric O. Lebigot Apr 21 '12 at 03:48
  • That's a good point about python reference counts causing copying pages in the `fork`ed child. – Maxim Egorushkin Apr 21 '12 at 05:23