Why is multiprocessing copying my data if I don't touch it?

Question

I was tracking down an out of memory bug, and was horrified to find that python's multiprocessing appears to copy large arrays, even if I have no intention of using them.

Why is python (on Linux) doing this, I thought copy-on-write would protect me from any extra copying? I imagine that whenever I reference the object some kind of trap is invoked and only then is the copy made.

Is the correct way to solve this problem for an arbitrary data type, like a 30 gigabyte custom dictionary to use a Monitor? Is there some way to build Python so that it doesn't have this nonsense?

import numpy as np
import psutil
from multiprocessing import Process
mem=psutil.virtual_memory()
large_amount=int(0.75*mem.available)

def florp():
    print("florp")

def bigdata():
    return np.ones(large_amount,dtype=np.int8)

if __name__=='__main__':
    foo=bigdata()#Allocated 0.75 of the ram, no problems
    p=Process(target=florp)
    p.start()#Out of memory because bigdata is copied? 
    print("Wow")
    p.join()

Running:

[ebuild   R    ] dev-lang/python-3.4.1:3.4::gentoo  USE="gdbm ipv6 ncurses readline ssl threads xml -build -examples -hardened -sqlite -tk -wininst" 0 KiB

Python (or, rather, CPython) uses a reference counter embedded in the object. Whenever an object is passed to a function, its reference counter is incremented, causing a modification to the object and thus a page fault in the child process. I wouldn't say it explains your particular example above though. Still, consider using multithreading instead of multiprocessing. — Ulrich Eckhardt, Jun 21 '15 at 10:49
You're right, I hacked enter too early, see the last two sentences which I appended. BTW: I can't reproduce these issues here, using Python 3.4.2 on Linux/x86_64. — Ulrich Eckhardt, Jun 21 '15 at 10:54
Have you tried to do the same in C: `malloc()` large amount, call `fork()` and see what happens? Try a different [start method: 'forkserver' or 'spawn'](https://docs.python.org/3/library/multiprocessing.html). Related: [How to avoid [Errno 12\] Cannot allocate memory errors caused by using subprocess module](http://stackoverflow.com/q/20111242/4279). Look at this [code example](https://gist.github.com/zed/7637011) and [this answer](http://stackoverflow.com/a/13329386/4279) — jfs, Jun 22 '15 at 12:54

score 2 · Answer 1 · 2015-06-21T11:31:15.717

I'd expect this behavior -- when you pass code to Python to compile, anything that's not guarded behind a function or object is immediately execed for evaluation.

In your case, bigdata=np.ones(large_amount,dtype=np.int8) has to be evaluated -- unless your actual code has different behavior, florp() not being called has nothing to do with it.

To see an immediate example:

>>> f = 0/0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: integer division or modulo by zero
>>> def f():
...     return 0/0
...
>>>

To apply this to your code, put bigdata=np.ones(large_amount,dtype=np.int8) behind a function and call it as your need it, otherwise, Python is trying to be hepful by having that variable available to you at runtime.

If bigdata doesn't change, you could write a function that gets or sets it on an object that you keep around for the duration of the process.

edit: Coffee just started working. When you make a new process, Python will need to copy all objects into that new process for access. You can avoid this by using threads or by a mechanism that will allow you to share memory between processes such as shared memory maps or shared ctypes

I tried wrapping the array in a function call, but no dice or did you have something else in mind? See edit. — Mikhail, Jun 21 '15 at 11:23
I'd have to see what you're actually doing then. I assume `florp()` does something more than just print in your actual code. Oh actually, I just thought of an important distinction about processes, give me a second to add it to my answer. — , Jun 21 '15 at 11:27

score 1 · Accepted Answer · answered Jun 24 '15 at 23:57

The problem was that by default Linux checks for the worst case memory usage, which can indeed exceed memory capacity. This is true even if the python language doesn't exposure the variables. You need to turn off "overcommit" system wide, to achieve the expected COW behavior.

sysctl `vm.overcommit_memory=2'

See https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Why is multiprocessing copying my data if I don't touch it?

2 Answers2