25

I have writen a program that can be summarized as follows:

def loadHugeData():
    #load it
    return data

def processHugeData(data, res_queue):
    for item in data:
        #process it
        res_queue.put(result)
    res_queue.put("END")

def writeOutput(outFile, res_queue):
    with open(outFile, 'w') as f
        res=res_queue.get()
        while res!='END':
            f.write(res)
            res=res_queue.get()

res_queue = multiprocessing.Queue()

if __name__ == '__main__':
    data=loadHugeData()
    p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue))
    p.start()
    processHugeData(data, res_queue)
    p.join()

The real code (especially writeOutput()) is a lot more complicated. writeOutput() only uses these values that it takes as its arguments (meaning it does not reference data)

Basically it loads a huge dataset into memory and processes it. Writing of the output is delegated to a sub-process (it writes into multiple files actually and this takes a lot of time). So each time one data item gets processed it is sent to the sub-process trough res_queue which in turn writes the result into files as needed.

The sub-process does not need to access, read or modify the data loaded by loadHugeData() in any way. The sub-process only needs to use what the main process sends it trough res_queue. And this leads me to my problem and question.

It seems to me that the sub-process gets its own copy of the huge dataset (when checking memory usage with top). Is this true? And if so then how can i avoid id (using double memory essentially)?

I am using Python 2.6 and program is running on linux.

martin
  • 2,520
  • 22
  • 29
FableBlaze
  • 1,785
  • 3
  • 16
  • 21
  • Can you restructure your code to use iterators instead of loading all that loadHugeData in ? It would seem that you could if it's as it seems load/process/enqueue/dequeue/write – sotapme Feb 07 '13 at 11:55
  • The "hugeData" is unfortunately a tab-separated txt file basically containing a sparse array. And I need "random access" to this data based on the line number during processing. Therefore loading it into memory (with sparse array specific optimisations) makes processing a lot faster. – FableBlaze Feb 07 '13 at 12:04
  • It might be massively over-engineering to suggest using something like `[beanstalkd](https://github.com/earl/beanstalkc/blob/master/TUTORIAL.mkd) to do the process integration but it would be interesting to know if it helped/scaled/perfomed. As usual other people's problems are always more interesting. – sotapme Feb 07 '13 at 13:14

2 Answers2

33

The multiprocessing module is effectively based on the fork system call which creates a copy of the current process. Since you are loading the huge data before you fork (or create the multiprocessing.Process), the child process inherits a copy of the data.

However, if the operating system you are running on implements COW (copy-on-write), there will only actually be one copy of the data in physical memory unless you modify the data in either the parent or child process (both parent and child will share the same physical memory pages, albeit in different virtual address spaces); and even then, additional memory will only be allocated for the changes (in pagesize increments).

You can avoid this situation by calling multiprocessing.Process before you load your huge data. Then the additional memory allocations will not be reflected in the child process when you load the data in the parent.

Edit: reflecting @Janne Karila's comment in the answer, as it is so relevant: "Note also that every Python object contains a reference count that is modified whenever the object is accessed. So, just reading a data structure can cause COW to copy."

isedev
  • 18,848
  • 3
  • 60
  • 59
  • 3
    Faster than me well done. Linux is COW so the moment the parent process writes to the data, the data will be duplicated. If the parent process only reads the data then there will only be one instance of the data **BUT** top (I'm almost sure) will show the data as belonging to both processes. meminfo should provide more accurate numbers on memory use. – Eli Algranti Feb 07 '13 at 11:39
  • 1
    Indeed. I think the most common OS are COW these days (I was just trying to be as generic as possible). Great feature but often causes confusion when interpreting the output of process-based memory reporting tools (i.e. top, ps, etc...). `meminfo` on Linux will report correctly as will `pmap` on Solaris; no idea about Windows though :) – isedev Feb 07 '13 at 11:46
  • 20
    Note also that every Python object contains a reference count that is modified whenever the object is accessed. So, just reading a data structure can cause COW to copy. – Janne Karila Feb 07 '13 at 11:46
  • Not sure about _accessed_. You probably mean referenced, no? (i.e. a = SomeObject, b = a) – isedev Feb 07 '13 at 11:48
  • 4
    Ty for the answer. Calling `multiprocessing.Process` before loading the data seems to have solved the issue. I will look into `meminfo` aswell. – FableBlaze Feb 07 '13 at 11:58
  • 1
    @isedev Even evaluating an expression involves temporary references. – Janne Karila Feb 07 '13 at 13:09
  • 1
    What about `Pool`? The same logic can be applied as well? – user3595632 Nov 09 '19 at 13:42
2

As recommended in the documentation (-> Explicitly pass resources to child processes), I would make the large data (explicitly) global, so if there is copy-on-write (COW) available and you fork new processes (in macOS it is spawn by default nowadays), the data is available in the child processes:

def loadHugeData():
    global data
    return data

def processHugeData(res_queue):
    global data
    for item in data:
        res_queue.put(result)
    res_queue.put("END")

But keep in mind, that Python data structures are copied. You would need to some more low-level data types such as numpy because of Python GIL.

Michael Dorner
  • 17,587
  • 13
  • 87
  • 117