4

I used this script (see code at the end) to assess whether a global object is shared or copied when the parent process is forked.

Briefly, the script creates a global data object, and the child processes iterate over data. The script also monitors the memory usage to assess whether the object was copied in the child processes.

Here are the results:

  1. data = np.ones((N,N)). Operation in the child process: data.sum(). Result: data is shared (no copy)
  2. data = list(range(pow(10, 8))). Operation in the child process: sum(data). Result: data is copied.
  3. data = list(range(pow(10, 8))). Operation in the child process: for x in data: pass. Result: data is copied.

Result 1) is expected because of copy-on-write. I am a bit puzzled by the results 2) and 3). Why is data copied?


Script

source

import multiprocessing as mp
import numpy as np
import logging
import os

logger = mp.log_to_stderr(logging.WARNING)

def free_memory():
    total = 0
    with open('/proc/meminfo', 'r') as f:
        for line in f:
            line = line.strip()
            if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
                field, amount, unit = line.split()
                amount = int(amount)
                if unit != 'kB':
                    raise ValueError(
                        'Unknown unit {u!r} in /proc/meminfo'.format(u = unit))
                total += amount
    return total

def worker(i):
    x = data.sum()    # Exercise access to data
    logger.warn('Free memory: {m}'.format(m = free_memory()))

def main():
    procs = [mp.Process(target = worker, args = (i, )) for i in range(4)]
    for proc in procs:
        proc.start()
    for proc in procs:
        proc.join()

logger.warn('Initial free: {m}'.format(m = free_memory()))
N = 15000
data = np.ones((N,N))
logger.warn('After allocating data: {m}'.format(m = free_memory()))

if __name__ == '__main__':
    main()

Detailed results

Run 1 output

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 23.3 GB [WARNING/Process-2] Free memory: 23.3 GB [WARNING/Process-4] Free memory: 23.3 GB [WARNING/Process-1] Free memory: 23.3 GB [WARNING/Process-3] Free memory: 23.3 GB

Run 2 output

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 21.9 GB [WARNING/Process-2] Free memory: 12.6 GB [WARNING/Process-4] Free memory: 12.7 GB [WARNING/Process-1] Free memory: 16.3 GB [WARNING/Process-3] Free memory: 17.1 GB

Run 3 output

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 21.9 GB [WARNING/Process-2] Free memory: 12.6 GB [WARNING/Process-4] Free memory: 13.1 GB [WARNING/Process-1] Free memory: 14.6 GB [WARNING/Process-3] Free memory: 19.3 GB

Community
  • 1
  • 1
usual me
  • 8,338
  • 10
  • 52
  • 95

1 Answers1

5

They're all copy-on-write. What you're missing is that when you do, e.g.,

for x in data:
    pass

the reference count on every object contained in data is temporarily incremented by 1, one at a time, as x is bound to each object in turn. For int objects, the refcount in CPython is part of the basic object layout, so the object gets copied (you did mutate it, because the refcount changes).

To make something more analogous to the numpy.ones case, try, e.g.,

data = [1] * 10**8

Then there's only a single unique object referenced many (10**8) times by the list, so there's very little to copy (the same object's refcount gets incremented and decremented many times).

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • Ok so I cannot iterate or do a simple look-up without triggering copies. Doesn't it make copy-on-write close to useless? – usual me Jun 23 '16 at 01:59
  • 4
    COW (copy-on-write) is a an OS concept that Python inherits from `fork()` on platforms supporting that function. COW wasn't designed with Python's multiprocessing (mp) in mind, and Python's mp wasn't designed with COW in mind ;-) COW is a platform-specific thing that may or may not be helpful, depending on the application. Note that for many object types in CPython, the refcount is _not_ stored "with the bulk of the object's data". In any case, COW was originally invented to implement exec-after-fork, where the advantage is that the vast bulk of the parent process data is never referenced. – Tim Peters Jun 23 '16 at 02:07