0

I'm new to parallel processing but have an application for which it will be useful. I have ~10-100k object instances (of type ClassA), and I want to use the multiprocessing module to distribute the work of calling a particular class method on each of the objects. I've read most of the multiprocessing documentation and several posts about calling class methods, but I have an additional complication that the ClassA objects all have an attribute pointing to the same instance of another type (ClassB), which they may add/remove themselves or other objects to/from. I know sharing state is bad for concurrent processes, so I'm wondering if this is even possible. To be honest, the Proxy/Manager mutliprocessing methods are a little too much over my head to understand all of their implications for shared objects, but if someone else assured me that I could get it to work I'd spend more time understanding them. If not, this will be a lesson in designing for distributed processes.

Here is a simplified version of my problem:

ClassA:
    def __init__(self, classB_state1, classB_state2, another_obj):
        # Pointers to shared ClassB instances
        self.state1 = classB_state1
        self.state2 = classB_state2
        self.state1.add(self)
        self.object = another_obj

    def run(classB_anothercommonpool):
        # do something to self.object
        if #some property of self.object: 
            classB_anothercommonpool.add(object)
            self.object = None

        self.switch_states()

    def switch_states(self):
        if self in self.state1: 
            self.state1.remove(self)
            self.state2.add(self)

        elif self in self.state2:
            self.state2.remove(self)
            self.state1.add(self)

        else: 
            print "State switch failed!"

ClassB(set): 
# This is essentially a glorified set with a hash so I can have sets of sets.
# If that's a bad design choice, I'd also be interested in knowing why
    def __init__(self, name):
        self.name = name
        super(ClassB, self).__init__()

    def __hash__(self):
        return id(self)

ClassC:
    def __init__(self, property):
        self.property = property

# Define an import-able function for the ClassA method, for multiprocessing
def unwrap_ClassA_run(classA_instance):
    return classA_instance.run(classB_anothercommonpool)

def initialize_states():
    global state1
    global state2
    global anothercommonpool

    state1            = ClassB("state1")
    state2            = ClassB("state2")
    anothercommonpool = ClassB("objpool")

Now, within the same .py file that the classes are defined:

from multiprocessing import Pool

def test_multiprocessing():
    initialize_states()

    # There are actually 10-100k classA instances
    object1 = ClassC('iamred')  
    object2 = ClassC('iamblue')
    classA1 = ClassA(state1, state2, object1)
    classA2 = ClassA(state1, state2, object2)

    pool = Pool(processes = 2)
    pool.map(unwrap_ClassA_run, [classA1, classA2])

If I import this module in an interpreter and run test_multiprocessing(), I get no errors at runtime, but the "Switch state failed!" message is displayed and if you examine the classA1/2 objects, they have not modified their respective objects1/2, nor switched membership of either of the ClassB state objects (so the ClassA objects do not register that they are a member of the state1 set). Thanks!

williaster
  • 93
  • 1
  • 6
  • 1
    There are multiple issues unrelated to using multiple processes in your code e.g., [`global x = y`](http://ideone.com/RnspQK) is not valid Python. You should not modify object after you've added them into a hash-based container (that is why `frozenset` is hashable unlike `set`), also modifying a global shared state requires synchronization (you could debug it using `multiprocessing.dummy` that uses threads (state is shared by default) while providing the same interface). Finally, without changing the data representation (and algirithm); it is unlikely that you improve performance using mp. – jfs May 08 '13 at 20:08
  • *Don't subclass built-ins!* That's a bad idea in 99.9% of the times. Also, don't you think that if making `set` hashable was that simple the dev's would have implemented it already? – Bakuriu May 08 '13 at 20:31
  • @J.F.Sebastian Sorry, some things got messed up in the simplification, the global variables are actually declared in another function (I updated). I will read up more on details of hash-based containers, etc. I'm taking some more intro CS classes about data containers right now, hopefully this will help fill in a few of the concepts I'm missing now. Assuming I can fix the container types, can I ask another naive question regarding my basic assumption for this post: why mp would not help run time here? – williaster May 08 '13 at 21:38
  • @Bakuriu Yes that's why I noted it potentially being bad in the code, thanks for the input. As J.F. Sebastian mentioned, this makes the `frozenset` make more sense. – williaster May 08 '13 at 21:46
  • @williaster: it is not a naive question, it is a very valid question: 1. there are two major way to share state between processes: copy it between processes (objects that you pass as arguments) or put it into a shared memory ([`sharedctypes`](http://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.sharedctypes)). In your case, it is a lot of copying and a very little processing. 2. You might need to serialize access to the global shared state (e.g., to avoid reading inconsistent state); [it limits parallelization](http://en.wikipedia.org/wiki/Amdahl's_law). – jfs May 09 '13 at 01:19
  • here's an explanation on [why without changing the data representation, your case involves a lot of copying if you use multiple processes](http://stackoverflow.com/a/1269055/4279). Here's [an example of how to share a `numpy` array between processes](http://stackoverflow.com/a/7908612/4279). Here's [an example of how to use `multiprocessing.Manager.list`](http://stackoverflow.com/a/15858898/4279). – jfs May 10 '13 at 04:46
  • Re: modifying an object after you've added it to a hash-based container. I understand that this could be really bad if you are modifying the object such that the hash value becomes different, but if you are defining __hash__() so that it is based on an object attribute (say .name) that is unique and does not change, I don't understand why it is a big deal to change the object in other ways. Still playing with fire if others use it / modify the name, but is this technically safe if you aren't modifying the value that feeds into the hash function? Thanks. – williaster May 30 '13 at 17:12

0 Answers0