0

I'm calling a function with large memory overhead on a list of N files. The reasons for the large memory overhead are due to a number of factors that cannot be resolved without modifying the function, however I have overcome the leaking memory using the multiprocessing module. By creating a subprocess for each of the N files, and then calling pool.close(), the memory from the function is released with minimal overhead. I have achieved this in the following example:

def my_function(n):
    do_something(file=n)
    return 


if __name__ == '__main__':

    # Initialize pool
    for n in range(0,N,1):
        pool = mp.Pool(processes=1)
        results = pool.map(my_function,[n])
        pool.close()
        pool.join()

This does exactly what I want: by setting processes=1 in pool, one file is run at a time for N files. After each n file, I call pool.close(), which closes the process and releases the memory back to the OS. Before, I didn't use multiprocessing at all, just a for loop, and the memory would accumulate until my system crashed.

My questions are

  1. Is this the correct way to implement this?
  2. Is there a better way to implement this?
  3. Is there a way to run more than one process at a time (processes>1), and still have the memory released after each n?

I'm just learning about the multiprocessing module. I've found many multiprocessing examples here, but none specific to this problem. I'll appreciate any help I can get.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94

1 Answers1

1

Is this the correct way to implement this?

"Correct" is a value judgement in this case. One could consider this either a bodge or a clever hack.

Is there a better way to implement this?

Yes. Fix my_function so that it doesn't leak memory. If a Python function is leaking lots of memory, chances are you are doing something wrong.

Is there a way to run more than one process at a time (processes>1), and still have the memory released after each n?

Yes. Use the maxtasksperchild argument when creating a Pool.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • This worked! I also had to set `chunksize=1` for `pool.map()`. Fixing the function is not feasible; the memory leak is due to a longstanding issue with how matplotlib retains references that cannot be removed or collected by garbage collection, and as a result, a large number of plots causes the memory leak. I have found a similar post to said leak and this solution here: https://stackoverflow.com/questions/7125710/matplotlib-errors-result-in-a-memory-leak-how-can-i-free-up-that-memory/7125856 Thank you! – maelstromscientist Apr 18 '20 at 04:54