1

I have a Python function example below which simply takes in a variable and performs a simple mathematical operation on it before returning.

If I parallelise this function, to better reflect the operation I would like to do in real life, and run the parallelised function 10 times, I notice on my IDE that the memory increases despite using the del results line.

import multiprocessing as mp
import numpy as np
from tqdm import tqdm

def function(x):
        return x*2

test_array = np.arange(0,1e4,1)

for i in range(10):

        pool = mp.Pool(processes=4)
        results = list(tqdm(pool.imap(function,test_array),total=len(test_array)))
        results = [x for x in results if str(x) != 'nan']

        del results

I have a few questions I would be grateful to know the answers to:

  • Is there a way to prevent this memory increase?
  • Is this memory loading due to the parallelisation process?
user8188120
  • 883
  • 1
  • 15
  • 30
  • Can you attempt to print out the memory usage before and after the delete statement? This answer shows how to do this: https://stackoverflow.com/questions/938733/total-memory-used-by-python-process – Brand0R Oct 09 '19 at 15:11
  • No problem. For the exact example above: before: 163520512 after: 164524032 in bytes – user8188120 Oct 09 '19 at 15:43

2 Answers2

1

I haven't tried this out, but i'm quite sure you don't need to define

pool= mp.Pool(processes=4)

Within the loop, you're starting up 10 instances of the pool for no reason. Maybe try moving that out and seeing if your memory usage decreases?

If that doesn't help, consider restructuring your code to utilize yield instead to prevent your memory from filling up.

Zhi Yong Lee
  • 149
  • 4
  • I moved it outside of the loop and that didn't prevent the memory filling up unfortunately but worth a try! I don't think yield would work purely because I'm not outputting iterative values..rather I'll be generating uncorrelated multidimensional arrays out of each process if that makes sense. – user8188120 Oct 09 '19 at 15:46
1

Each new process that pool.imap creates needs to receive some information about the function and the element it applies the function too. This information is copies, and will therefore cause information to be copies.

If you want to reduce it, you might want to look at the chunksize argument of pool.imap.

An other way would be to just rely on functions from numpy. You might already now, but you could just do results = test_array * 2. I don't know how your real life example looks like, but you might not need to use Python's pool.

Also, if you intend to actually write fast code, don't use tqdm. It is nice and if you need it, you need it, but it will slow down your code.

Hielke Walinga
  • 2,677
  • 1
  • 17
  • 30
  • Thanks for the reply! I'll have a look at the docs to see what I can do with chunksize. As for the function itself it's actually a few hundred lines long class rather than the simple example given here unfortunately. The only reason I'd use pool is to speed up the function output but if removing the pool session will only create a function-information-overhead once then that could be a good way to save memory. – user8188120 Oct 09 '19 at 15:40