2

From this question and its answers, I think I understand why this python code:

big_list = [
    {j: 0 for j in range(200000)}
    for i in range(60)
]

def worker():
    for dic in big_list:
        for key in dic:
            pass
        print "."
        time.sleep(0.2)

w = multiprocessing.Process(target=worker)
w.start()

time.sleep(3600)

keeps using more and more memory during its execution: it's because the child process updates reference count to a shared-memory object in the loop, triggering the "copy-on-write" mecanism (I can watch the free memory diminushing via cat /proc/meminfo | grep MemFree).

What I don't understand, however, is why the same thing happens if the iteration takes place in the parent rather than in the child:

def worker():
    time.sleep(3600)

w = multiprocessing.Process(target=worker)
w.start()

for dic in big_list:
    for key in dic:
        pass
    print "."
    time.sleep(0.2)

The child don't even needs to know the existence of big_list.

In this small example I can solve the problem by putting del big_list in the child function, but sometimes variables references are not accessible like this one, so things get complicated.

Why is this mecanism happening and how can I avoid it properly?

fspot
  • 23
  • 4
  • Your results and question may be OS (Unix/Linux/OSX) dependent. They certainly aren't coded properly for Windows (no `if __name__ == '__main__':`, see [**Safe importing of main module**](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) in the docs). – martineau May 28 '17 at 22:09

1 Answers1

2

After a fork(), both parent and child "see" the same address space. The first time either changes the memory at a common address, the copy-on-write (COW) mechanism has to clone the page containing that address. So, for purposes of creating COW pages, it doesn't matter whether the mutations occur in the child or in the parent.

In your second code snippet, you left out the most important part: exactly where big_list was created. Since you said you can get away with del big_list in the child, big_list probably existed before you forked the worker process. If so, then - as above - it doesn't really matter to your symptom whether big_list is modified in the parent or the child.

To avoid this, create big_list after creating your child process. Then the address space it lives in won't be shared. Or, in Python 3.4 or later, use multiprocessing.set_start_method('spawn'). Then fork() won't be used to create child processes, and no address space is shared at all (which is always the case on Windows, which doesn't have fork()).

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • Thank you for the explanation ! I tried to use [billiard](https://github.com/celery/billiard/) to use `set_start_method('spawn')` in python2 but using billiard.Queue made the communication between processes extremely slow, thus useless for my use case. I ended up doing what you suggested : fork earlier (even if in my case it may be something like one hour before the real usage). – fspot May 30 '17 at 18:43