Leveraging "Copy-on-Write" to Copy Data to Multiprocessing.Pool() Worker Processes

Question

I have a bit of multiprocessing Python code that looks a bit like this:

import time
from multiprocessing import Pool
import numpy as np

class MyClass(object):
    def __init__(self):
        self.myAttribute = np.zeros(100000000) # basically a big memory struct

    def my_multithreaded_analysis(self):
        arg_lists = [(self, i) for i in range(10)]
        pool = Pool(processes=10)
        result = pool.map(call_method, arg_lists)
        print result

    def analyze(self, i):
        time.sleep(10)
        return i ** 2

def call_method(args):
    my_instance, i = args
    return my_instance.analyze(i)


if __name__ == '__main__':
    my_instance = MyClass()
    my_instance.my_multithreaded_analysis()

After reading answers about how memory works in other StackOverflow answers such as this one Python multiprocessing memory usage I was under the impression that this would not use memory in proportion to how many processes I used for multiprocessing, since it is copy-on-write and I have not modified any of the attributes of my_instance. However, I do see high memory for all processes when I run top it says most of my processes are using a lot of memory (this is top output from OSX, but I can replicate on Linux).

My question is basically, am I interpreting this correctly in that my instance of MyClass is actually duplicated across the pool? And if so, how can I prevent this; should I just not use a construction like this? My goal is to reduce memory usage for a computational analysis.

PID   COMMAND      %CPU  TIME     #TH    #WQ  #PORT MEM    PURG   CMPRS  PGRP PPID STATE
2494  Python       0.0   00:01.75 1      0    7     765M   0B     0B     2484 2484 sleeping
2493  Python       0.0   00:01.85 1      0    7     765M   0B     0B     2484 2484 sleeping
2492  Python       0.0   00:01.86 1      0    7     765M   0B     0B     2484 2484 sleeping
2491  Python       0.0   00:01.83 1      0    7     765M   0B     0B     2484 2484 sleeping
2490  Python       0.0   00:01.87 1      0    7     765M   0B     0B     2484 2484 sleeping
2489  Python       0.0   00:01.79 1      0    7     167M   0B     597M   2484 2484 sleeping
2488  Python       0.0   00:01.77 1      0    7     10M    0B     755M   2484 2484 sleeping
2487  Python       0.0   00:01.75 1      0    7     8724K  0B     756M   2484 2484 sleeping
2486  Python       0.0   00:01.78 1      0    7     9968K  0B     755M   2484 2484 sleeping
2485  Python       0.0   00:01.74 1      0    7     171M   0B     594M   2484 2484 sleeping
2484  Python       0.1   00:16.43 4      0    18    775M   0B     12K    2484 2235 sleeping

How did you generate this profiler result? – Stefan Mar 22 '21 at 17:06 — Stefan, Mar 22 '21 at 17:06

score 49 · Accepted Answer · edited Feb 07 '21 at 17:40

49

Anything sent to pool.map (and related methods) isn't actually using shared copy-on-write resources. The values are "pickled" (Python's serialization mechanism), sent over pipes to the worker processes and unpickled there, which reconstructs the object in the child from scratch. Thus, each child in this case ends up with a copy-on-write version of the original data (which it never uses, because it was told to use the copy sent via IPC), and a personal recreation of the original data that was reconstructed in the child and is not shared.

If you want to take advantage of forking's copy-on-write benefits, you can't send data (or objects referencing the data) over the pipe. You have to store them in a location that can be found from the child by accessing their own globals. So for example:

import os
import time
from multiprocessing import Pool
import numpy as np

class MyClass(object):
    def __init__(self):
        self.myAttribute = os.urandom(1024*1024*1024) # basically a big memory struct(~1GB size)

    def my_multithreaded_analysis(self):
        arg_lists = list(range(10))  # Don't pass self
        pool = Pool(processes=10)
        result = pool.map(call_method, arg_lists)
        print result

    def analyze(self, i):
        time.sleep(10)
        return i ** 2

def call_method(i):
    # Implicitly use global copy of my_instance, not one passed as an argument
    return my_instance.analyze(i)

# Constructed globally and unconditionally, so the instance exists
# prior to forking in commonly accessible location
my_instance = MyClass()


if __name__ == '__main__':
    my_instance.my_multithreaded_analysis()

By not passing self, you avoid making copies, and just use the single global object that was copy-on-write mapped into the child. If you needed more than one object, you might make a global list or dict mapping to instances of the object prior to creating the pool, then pass the index or key that can look up the object as part of the argument(s) to pool.map. The worker function then uses the index/key (which had to be pickled and sent to the child over IPC) to look up the value (copy-on-write mapped) in the global dict (also copy-on-write mapped), so you copy cheap information to lookup expensive data in the child without copying it.

If the objects are smallish, they'll end up copied even if you don't write to them. CPython is reference counted, and the reference count appears in the common object header and is updated constantly, just by referring to the object, even if it's a logically non-mutating reference. So small objects (and all the other objects allocated in the same page of memory) will be written, and therefore copied. For large objects (your hundred million element numpy array), most of it would remain shared as long as you didn't write to it, since the header only occupies one of many pages

Changed in python version 3.8: On macOS, the spawn start method is now the default. See mulitprocessing doc. Spawn is not leveraging copy-on-write.

edited Feb 07 '21 at 17:40

dre-hh

7,840
2
33
44

answered Jul 01 '16 at 01:44

ShadowRanger

143,180
12
188
271

7

Also note: If the objects are smallish, they'll end up copied even if you don't write to them. CPython is reference counted, and the reference count appears in the common object header and is updated constantly, just by referring to the object, even if it's a logically non-mutating reference. So small objects (and all the other objects allocated in the same page of memory) will be written, and therefore copied. For large objects (your hundred million element `numpy` array), most of it would remain shared as long as you didn't write to it, since the header only occupies one of many pages. – ShadowRanger Jul 02 '16 at 00:26
I've incorporated your comment into the answer body. The implication of that statement is that for vanilla Python data structures (lists, dicts etc), a copy is triggered at point of reference in the child process therefore you might as well pass the structure explicitly as a method parameter and be done with it. Would you know if a way exists to prevent this behaviour? – iruvar Jan 25 '20 at 19:19
1

@iruvar: It's still cheaper to have it duplicated via COW than to pickle it, send it via a pipe, then unpickle it on the other side. And any stuff that isn't actually referenced (data created in the parent and not loaded in the workers) won't be duplicated. The only ways to "prevent" this behavior are to use non-CPython interpreters (though their GC process is likely to trigger similar behaviors), or use non-`fork` start methods (so you'll have to send stuff via pickling, but at least you've got far less that could potentially be copied). – ShadowRanger Jan 26 '20 at 22:35
The easiest, perhaps oversimplistic bottom-line: Use global variables for anything you do not want to be copied and pickled. – Radio Controlled Oct 15 '20 at 09:06
on python3 the big memory struct gets pickled anyway – dre-hh Feb 05 '21 at 16:26
@dre-hh: It definitely doesn't if you're actually `fork`ing; there's no place in the code for it to transfer it. I see you down-voted, but I guarantee you you're not seeing what you think you're seeing, or you're using code that's subtly different (e.g. mapping over a bound method of the big struct, not a stand-alone function; bound methods are pickled with the instance they're bound to, which would get that result). – ShadowRanger Feb 05 '21 at 17:36
upvoted back...will do more debugging and publish a gist. i have used exactly the same code and replaced `myattribute` with a faststext model which is 7gb in memory. the model outputs some log for being load exactly n processes times. There is a delay of n times loeading the model. Each python process shows up using 7gb memory and if i go with more than 5 processes the osx activity monitor show a memory preasure, everything starts swapping, and the program never finishes even with couple of itmes to process i want to check whether i can reproduce this only with a large numpy array – dre-hh Feb 05 '21 at 20:10
i can definitely reproduce the behaviour with a an `os.urandom()` string of 5gb . however i have overriden the __setstate__ method which would be called upon unpickling, and its not called (its called when i pickle it manually). I suppose this is OS X behavior which does not support copy on write. – dre-hh Feb 05 '21 at 21:38
2

@dre-hh: [macOS defaults to using the `'spawn'` method instead of `'fork'` starting in 3.8, because macOS system frameworks are not `fork`-safe](https://bugs.python.org/issue33725). The way `'spawn'` works is *very* different from the way `'fork'` works (it does a bunch of stuff to simulate forking, sort of, but COW isn't involved, at all). You can always try opting in to the `'fork'` start method (at the expense of possibly crashing your code if you get unlucky on the `fork` timing). – ShadowRanger Feb 06 '21 at 00:58
unfortunately i can't upvote the answer back anymore until its changed. I've made an edit suggestion, so i can upvote back. using random bytes rather than string is also better on osx, because there is memory compression and zeros are compressed away – dre-hh Feb 07 '21 at 17:43

The Aelfinn · Answer 2 · 2018-08-29T20:49:51.207

Alternatively, to take advantage of forking's copy-on-write benefits, while preserving some semblance of encapsulation, you could leverage class-attributes and @classmethods over pure globals.

import time
from multiprocessing import Pool
import numpy as np

class MyClass(object):

    myAttribute = np.zeros(100000000) # basically a big memory struct
    # myAttribute is a class-attribute

    @classmethod
    def my_multithreaded_analysis(cls):
        arg_list = [i for i in range(10)]
        pool = Pool(processes=10)
        result = pool.map(analyze, arg_list)
        print result

    @classmethod
    def analyze(cls, i):
        time.sleep(10)
        # If you wanted, you could access cls.myAttribute w/o worry here.
        return i ** 2

""" We don't need this proxy step !
    def call_method(args):
        my_instance, i = args
        return my_instance.analyze(i)
"""

if __name__ == '__main__':
    my_instance = MyClass()
    # Note that now you can instantiate MyClass anywhere in your app,
    # While still taking advantage of copy-on-write forking
    my_instance.my_multithreaded_analysis()

Note 1: Yes, I admit that class-attributes and class-methods are glorified globals. But it buys a bit of encapsulation...

Note 2: Rather than explicitly creating your arg_lists above, you can implicitly pass the instance (self) to each task created by Pool, by passing the bound-instance method analyze(self) to Pool.map(), and shoot yourself in the foot even easier!

Leveraging "Copy-on-Write" to Copy Data to Multiprocessing.Pool() Worker Processes

2 Answers2

Linked

Related