Python map function, threads, big structures

Question

My problem is simple, I need to map a function to a list of values, and return the maximum value. So, a reduction after the map, a mapreduce if you like the term.

I read, being a newbie to python, that I should use multiprocessing instead of threads, and that would be ok, but I don't know if this will cripple my program. The problem is that the function I need to map needs several parameters, and some data structures it needs are huge.

So, I am worried that with big data being passed to processes, the results would be creating new copies, and that could be unacceptable.

What recommendations would you have in solving this simple problem? Will I face multiple copies, should I manually make some shared memory, or the VM will automagically make it for me?

Thanks!

How huge is *huge* in your context? Python can handle an extremely high load. I wrote a program once that handled big data (think upwards of 10Gb). But that was just simple stock data over 25 years. Unless you're hinting at Hadoop-levels of data, which is not unlikely since you've used the term `mapreduce`, then you should be fine, I believe. — WGS, Oct 21 '14 at 15:06
Most operating systems implement copy-on-write, meaning that the large data structures will only be copied to the child process if either the parent or child modifies it; otherwise, reads are from the parent's copy. — chepner, Oct 21 '14 at 15:06
@chepner The one caveat with that is that each object will have its ref-count incremented in the child, even if you don't modify the object. That means the page containing the ref-count variable is going to get copied. That's still much better than having to copy the entire object, though. — dano, Oct 21 '14 at 15:37
`multiprocessing` performance is pretty bad if it needs to copy large data structures between processes. Some of that can be mitigated if you're using an OS that supports copy-on-write forking, but if you need to return large objects back the the parent, you're out of luck. The `multiprocessing` module [does support shared variables](https://docs.python.org/2.7/library/multiprocessing.html#shared-ctypes-objects), but they need to be `ctypes` objects; it doesn't support arbitrary Python objects. — dano, Oct 21 '14 at 15:40
@dano They shouldn't; the objects aren't shared at the Python level, but at the OS level. Each process sees the ref-count just as it was before the process forked. Only if the ref-count is subsequently increased in-process would a write be necessary. (That said, it's quite possible that something in the code will increase the ref count.) — chepner, Oct 21 '14 at 15:40
@chepner Right, there's a whole bunch of things you can do with that object that will increase the ref-count that don't look like modifying the object (passing it to a function, assigning it to another variable etc.). Most of the useful things you'd want to do with it, basically. — dano, Oct 21 '14 at 15:48
Yup, I was thinking that an increase that would happen in each process "wouldn't count", but that's clearly wrong :) — chepner, Oct 21 '14 at 16:04
Well, in memory I need large numerical data sets, and these may amount to several GB of mesh, fields, and other menial data structures. However I think that copying a memory page isn't bad, copying the whole mesh would be :) — senseiwa, Oct 22 '14 at 07:22

score 1 · Accepted Answer · edited May 23 '17 at 12:05

You have a few options to accomplish this, each with their own advantages and disadvantages.

Argument Lists

Let's get the easy one out of the way first: passing your data structures will most definitely create a copy for each process. It sounds like this is not what you want.

Managers and Proxies

I recommend you try this method out first. The multiprocessing library supports proxy objects. These act like aliases to shared objects and are easy to use if you're dealing with native types like lists and dictionaries. This method even allows you to safely modify the shared objects without having to worry about the lock details, because the manager will take care of them. Since you mentioned lists, this may be your best bet. You can also create proxies to your own custom objects.

Global Data Structures

In some situations, an acceptable solution is to make your data structures global as an alternative to passing them as arguments. They will not be copied between processes if you only read from them. This can trip people up when they don't realize that creating a local reference to a global variable counts as writing to it because the variable's reference count must be incremented. This is how the garbage collector knows when memory can be freed: if an object's reference count is zero, then no one is using it and it can be removed. Here's a code snippet that demonstrates this:

import sys
import multiprocessing

global_var = [1, 2, 3]

def no_effect1():
    print(global_var[0] + global_var[1])
    print("No Effect Count: {}".format(sys.getrefcount(global_var)))
    return

def no_effect2():
    new_list = [i for i in global_var]
    print("New List Count: {}".format(sys.getrefcount(global_var)))

def change1():
    local_handle = global_var
    print("Local Handle Count: {}".format(sys.getrefcount(global_var)))

def change2():
    new_dict = {'item':global_var}
    print("Contained Count: {}".format(sys.getrefcount(global_var)))

p_list = [multiprocessing.Process(target=no_effect1),
          multiprocessing.Process(target=no_effect2),
          multiprocessing.Process(target=change1),
          multiprocessing.Process(target=change2)]

for p in p_list:
    p.start()

for p in p_list:
    p.join()


    p.join()

This code produces this output:

3
No Effect Count: 2
New List Count: 2
Local Handle Count: 3
Contained Count: 3

In the no_effect1() function, we are able to read and use data from the global structure without increasing the ref count. no_effect2() constructs a new list from the global structures. In both cases, we are reading the globals, but not creating any local references to the same underlying memory. If you use your global data structures in this way, then you will not cause them to be copied between processes.

However, notice that in change1() and change2() the reference count was incremented because we bound a local variable to the same data structure. This means we have modified the global structure and it will be copied.

Shared Ctypes

If you can finesse your shared data into C arrays, you can use shared Ctypes. These are arrays (or single values) that are allocated from shared memory. You can then pass around a wrapper without the underlying data being copied.

MMAP

You can also create a shared memory-map to put data into, but it can get complicated, and I would only recommend doing it if the proxy and global options don't work for you. There is a blog post here that has a decent example.

Other Thoughts

One nitpicky point: In your question, you referred to a "VM". Since you didn't specify that you're running on a VM, I assume you're referring to the Python interpreter as a VM. Keep in mind that Python is an interpreter, and does not provide a virtual machine environment like Java. The line is certainly blurred and the correct use of the terminology is open for debate, but people generally don't refer to the Python interpreter as a VM. See the excellent answers to this SO question for a more nuanced explanation of the differences.

So, if I have a module with a variable I could simply import it, and use for instance mymodule.variable if I use it read-only, without any problems (other than a refcount increase)? That would be very nice. — senseiwa, Oct 22 '14 at 07:32
PS. Thanks for clarifying VM/Interpreter, even though as you say, things are blurry :) — senseiwa, Oct 22 '14 at 07:33
One word of caution about your global data structure-containing module: when worker processes are launched with the multiprocessing module, they will import the *.py file that contains the target function. If that file imports your data structure module, you may still end up with copies. See [this PMOTW entry](http://pymotw.com/2/multiprocessing/basics.html#importable-target-functions) for a bit more detail. You may want to run some tests to confirm that you won't get copies before you try to implement it in your real project. — skrrgwasme, Oct 22 '14 at 16:28