9

I am trying to reduce the memory requirements of my python 3 code. Right now each iteration of the for loop requires more memory than the last one.

I wrote a small piece of code that has the same behaviour as my project:

import numpy as np
from multiprocessing import Pool
from itertools import repeat


def simulation(steps, y):  # the function that starts the parallel execution of f()
    pool = Pool(processes=8, maxtasksperchild=int(steps/8))
    results = pool.starmap(f, zip(range(steps), repeat(y)), chunksize=int(steps/8))
    pool.close()
    return results


def f(steps, y):  # steps is used as a counter. My code doesn't need it.
        a, b = np.random.random(2)
        return y*a, y*b

def main():
    steps = 2**20  # amount of times a random sample is taken
    y = np.ones(5)  # dummy variable to show that the next iteration of the code depends on the previous one
    total_results = np.zeros((0,2))
    for i in range(5):
        results = simulation(steps, y[i-1])
        y[i] = results[0][0]
        total_results = np.vstack((total_results, results))

    print(total_results, y)

if __name__ == "__main__":
    main()

For each iteration of the for loop the threads in simulation() each have a memory usage equal to the total memory used by my code.

Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?

Ideally I would want my code to only copy the memory it requires to execute f() while I can save the results in memory.

Isea
  • 93
  • 1
  • 7
  • yes it clones the entire context of the program – Joran Beasley Mar 17 '16 at 17:25
  • You should guard those main operations with `if __name__ == '__main__':` at least. – Ilja Everilä Mar 17 '16 at 17:26
  • @snakecharmerb results[0][0] is just a float. – Isea Mar 17 '16 at 17:28
  • @Ilja Yes, good point. I do this in my real code but I figured for this example it wasn't required. I'll insert it (but it doesn't change the behaviour). – Isea Mar 17 '16 at 17:29
  • What do you btw exactly mean by "the threads (processes) each have a memory usage equal to the total memory used by my code"? You're not mixing forking's [COW](https://en.wikipedia.org/wiki/Copy-on-write) behaviour with actual mem usage perhaps? Which somewhat means "only copy the memory it requires to execute f()". – Ilja Everilä Mar 17 '16 at 17:50
  • How are you measuring the memory consumption? – Padraic Cunningham Mar 17 '16 at 17:52
  • Been running the test script with larger and larger for loop iteration counts, but my memory consumption is pretty steady. Fluctuates quite a bit, but doesn't rise. – Ilja Everilä Mar 17 '16 at 17:57
  • @Ilja I mean that, for example, if my code has 3% memory usage when it calls simulation(), each thread (as shown on htop) will start with 3% real memory (not virtual). Then on the next iteration my code will use more than 3% memory per thread because it also seems to copy the results and results_total array when creating new threads. The code doesn't seem to implement COW behaviour (if I understand it correctly) because each thread seems to take up actual physical memory the moment it starts execution. – Isea Mar 17 '16 at 17:58
  • @PadraicCunningham I check htop's RES column. – Isea Mar 17 '16 at 17:59
  • 1
    @Isea after your edit I can observe the behaviour you describe. – Ilja Everilä Mar 17 '16 at 18:00
  • Still at it, one thing came to mind: example code is example code, but in this case the whole multiprocess mapping simulation step could be replaced with `np.random.random((steps, 2)) * y[i - 1]`. Have you considered vectorization of the actual problem more thoroughly? – Ilja Everilä Mar 17 '16 at 20:00
  • @Ilja Great to hear you're working on it! My real code involves constructing a matrix with the random values I generate and then calling np.linalg.eigh() on this matrix. Since the matrix needs to be generated for each set of random values I cannot use the solution you proposed. – Isea Mar 17 '16 at 20:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/106650/discussion-between-ilja-and-isea). – Ilja Everilä Mar 17 '16 at 20:06

2 Answers2

2

Though the script does use quite a bit of memory even with the "smaller" example values, the answer to

Does Python clone my entire environment each time the parallel processes are run, including the variables not required by f()? How can I prevent this behaviour?

is that it does in a way clone the environment with forking a new process, but if copy-on-write semantics are available, no actual physical memory needs to be copied until it is written to. For example on this system

 % uname -a 
Linux mypc 4.2.0-27-generic #32-Ubuntu SMP Fri Jan 22 04:49:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

COW seems to be available and in use, but this may not be the case on other systems. On Windows this is strictly different as a new Python interpreter is executed from .exe instead of forking. Since you mention using htop, you're using some flavour of UNIX or UNIX like system, and you get COW semantics.

For each iteration of the for loop the processes in simulation() each have a memory usage equal to the total memory used by my code.

The spawned processes will display almost identical values of RSS, but this can be misleading, because mostly they occupy the same actual physical memory mapped to multiple processes, if writes do not occur. With Pool.map the story is a bit more complicated, since it "chops the iterable into a number of chunks which it submits to the process pool as separate tasks". This submitting happens over IPC and submitted data will be copied. In your example the IPC and 2**20 function calls also dominate the CPU usage. Replacing the mapping with a single vectorized multiplication in simulation took the script's runtime from around 150s to 0.66s on this machine.

We can observe COW with a (somewhat) simplified example that allocates a large array and passes it to a spawned process for read-only processing:

import numpy as np
from multiprocessing import Process, Condition, Event
from time import sleep
import psutil


def read_arr(arr, done, stop):
    with done:
        S = np.sum(arr)
        print(S)
        done.notify()
    while not stop.is_set(): 
        sleep(1)


def main():
    # Create a large array
    print('Available before A (MiB):', psutil.virtual_memory().available / 1024 ** 2)
    input("Press Enter...")
    A = np.random.random(2**28)
    print('Available before Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
    input("Press Enter...")
    done = Condition()
    stop = Event()
    p = Process(target=read_arr, args=(A, done, stop))
    with done:
        p.start()
        done.wait()
    print('Available with Process (MiB):', psutil.virtual_memory().available / 1024 ** 2)
    input("Press Enter...")
    stop.set()
    p.join()

if __name__ == '__main__':
    main()

Output on this machine:

 % python3 test.py
Available before A (MiB): 7779.25
Press Enter...
Available before Process (MiB): 5726.125
Press Enter...
134221579.355
Available with Process (MiB): 5720.79296875
Press Enter...

Now if we replace the function read_arr with a function that modifies the array:

def mutate_arr(arr, done, stop):
    with done:
        arr[::4096] = 1
        S = np.sum(arr)
        print(S)
        done.notify()
    while not stop.is_set(): 
        sleep(1)

the results are quite different:

Available before A (MiB): 7626.12109375
Press Enter...
Available before Process (MiB): 5571.82421875
Press Enter...
134247509.654
Available with Process (MiB): 3518.453125
Press Enter...

The for-loop does indeed require more memory after each iteration, but that's obvious: it stacks the total_results from the mapping, so it has to allocate space for a new array to hold both the old results and the new and free the now unused array of old results.

Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
0

Maybe you should know the difference between thread and process in Operating System. see this What is the difference between a process and a thread.

In the for loop, there are processes, not threads. Threads share the address space of the process that created it; processes have their own address space.

You can print the process id, type os.getpid().

Community
  • 1
  • 1
GoingMyWay
  • 16,802
  • 32
  • 96
  • 149
  • It's not so black and white, as I pointed out. Processes may share physical memory, though they have their own address spaces. – Ilja Everilä Mar 18 '16 at 07:53