3

My impression with python multiprocessing is that when you create a new process with multiprocessing.Process(), it creates an entire copy of your current program in memory and continues working from there. With that in mind, I'm confused by the behaviour of the following script.

WARNING: This script will allocate a large amount of memory! Run it with caution!

import multiprocessing
import numpy as np
from time import sleep

#Declare a dictionary globally
bigDict = {}

def sharedMemory():
    #Using numpy, store 1GB of random data
    for i in xrange(1000):
        bigDict[i] = np.random.random((125000))
    bigDict[0] = "Known information"

    #In System Monitor, 1GB of memory is being used
    sleep(5)

    #Start 4 processes - each should get a copy of the 1GB dict
    for _ in xrange(4):
        p = multiprocessing.Process(target=workerProcess)
        p.start()

    print "Done"

def workerProcess():
    #Sleep - only 1GB of memory is being used, not the expected 4GB
    sleep(5)

    #Each process has access to the dictionary, even though the memory is shared
    print multiprocessing.current_process().pid,bigDict[0]

if __name__ == "__main__":
    sharedMemory()

The above program illustrates my confusion - it seems like the dict automatically becomes shared between the processes. I thought to get that behaviour I had to use a multiprocessing manager. Could someone explain what is going on?

The Bearded Templar
  • 671
  • 1
  • 7
  • 16

1 Answers1

4

On Linux, forking a process doesn't result in twice the memory being occupied immediately. Instead, the page table of the new process will be set up to point to the same physical memory as the old process, and only if one of the processes attempts to do a write to one of the pages, they get actually copied (copy on write, COW). The result is that it appears that both processes have separate memory, but physical memory is only allocated once one of the process actually touches the memory.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • Alright, that makes sense. However, at the beginning of workerProcess(), I added the line `bigDict[5] = multiprocessing.current_process().pid`, and then printed it later and found the correct id is being stored in each process but the memory still doesn't increase. In a dict like the one I'm using, does it handle each element separately? (that is, is only bigDict[5] being copied, instead of the whole thing?). – The Bearded Templar Jan 21 '15 at 18:06
  • Dicts, like most containers in Python, just store a reference to their elements, not copies of the elements. – jpkotta Jan 21 '15 at 21:08
  • @TheBeardedTemplar: The granularity of memory as far as the operating system is concerned is *pages*, which are usually 4KB on Linux. The OS doesn't know anything about the data structures the proces chooses to store in that memory. A single change as the one you describe is expected to increase the memory usage by 4 KB. – Sven Marnach Jan 22 '15 at 12:05