python multiprocessing memory overflow

Question

In brief, I have a (1000 x 500000) matrix which I need to sort row by row, ideally..parallel processing should work, but the multiprocessing module in python seems to make a copy of the entire matrix each time a process is spawned, leading to RAM overflow. How do I tackle this issue..?

def sort_parallel(n):
    y[n].sort(key=lambda y:-y[1])

if __name__ == '__main__':
    pool = Pool(processes=2)  
    pool.apply(sort_parallel,range(0,len(y)))
    pool.close()    
    pool.join()

Following similar questions, have tried map,map_async, apply_async with no progress, the fundamental problem seems to be the copies of lists for each process.. which floods the RAM, which can possibly be prevented by read-only operation, but as I am doing in-place sorting..it doesn't help me. Also tried sorted() instead of sort(), still no solution in sight.

You need to sort each row? But yes, multiprocessing will copy the data. It has to. — juanpa.arrivillaga, Aug 24 '18 at 22:44
yes, I need to sort each row individually, in theory 1000 parallel operations independent of each other, please suggest me the most efficient way to achieve this in python. — Ajay V, Aug 24 '18 at 22:52
If you use a numpy array I'm able to sort it in about 20-30 seconds — juanpa.arrivillaga, Aug 24 '18 at 22:58
Thanks @juanpa.arrivillaga, I have batches of these 1000 x 500,000 hence probing if I can introduce some parallelism to hopefully make it perform better. — Ajay V, Aug 24 '18 at 23:13

score 0 · Answer 1 · answered Aug 24 '18 at 23:40

I believe in this case python multiprocessing module will capture the closure of the function by copying all the variables in that function.

There are several ways around that but you need to pass in each row (or the index of the row) to each thread to avoid copying the whole array. In conjunction, you will need to establish a mechanism for storing/modifying the whole array after sorting each thread is complete. You can either use a file or you can refer to this question for a shared object in multiprocessing.

python multiprocessing memory overflow

1 Answers1