2

I have a dictionary of objects, and I would like to populate this dictionary using the multiprocessing package. This bit of code runs "Run" many times in parallel.

Data=dict()
for i in range:
   Data[i]=dataobj(i) #dataobj is a class I have defined elsewhere
   proc=Process(target=Run, args=(i, Data[i]))
   proc.start()

Where "Run" does some simulations, and saves the output in the dataobj object

def Run(i, out):
    [...some code to run simulations....]
    out.extract(file)

My code creates a dictionary of objects, and then modifies the objects in that dictionary in parallel. Is this possible, or do I need to acquire a lock every time I modify an object in the shared dictionary?

Jeff
  • 2,040
  • 3
  • 18
  • 19
  • 1
    Locks add very little overhead, so you could use it to be sure. Regarding operations on dictionaries: http://stackoverflow.com/a/6955678/1444854, objects don't have any thread safety. – Vincent Ketelaars May 08 '14 at 17:06

1 Answers1

4

basically, as you're using multiprocessing, then your processes share copies of the original dictionary of objects, and thus populate different ones. What the multiprocessing package handles for you, is messaging of python objects between processes to make things less painful.

A good design for your problem is to have the main process handling populating the dictionary, and have its children processes handle the work. Then use a queue to exchange data between the children processes and the master process.

As a general design idea, here's something that could be done:

from queue import Queue

queues = [Queue(), Queue()]

def simulate(qin, qout):
    while not qin.empty():
        data = qin.pop()
        # work with the data
        qout.put(data)
    # when the queue is empty, the process ends

Process(target=simulate, args=(queues[0][0],queues[0][1])).start()
Process(target=simulate, args=(queues[1][0],queues[1][1])).start()

processed_data_list = []

# first send the data to be processed to the children processes
while data.there_is_more_to_process():
    # here you have to adapt to your context how you want to split the load between your processes
    queues[0].push(data.pop_some_data())
    queues[1].push(data.pop_some_data())

# then for each process' queue 
for qin, qout in queues:
    # you populate your output data list (or dict or whatever)
    while not qout.empty:
        processed_data_list.append(qout.pop())
# here again, you have to adapt to your context how you handle the data sent
# back from the children processes.

though, take it as only a design idea, as this code has a few design flaws, which will get naturally solved when working with real data and processing functions.

zmo
  • 24,463
  • 4
  • 54
  • 90
  • Thanks! I did something a bit different (and more clunky), but simpler for me to undrstand. I used Queue as you suggested. Say I run my process n times. I made a big dictionary with n queue objects as the values, and range(1,n) as the keys. Each parallel processes writes to its own queue object with queue.put(), and then I use queue.get() at the end to read from each queue and put the result into my object – Jeff May 08 '14 at 18:59
  • yup, the only important thing is to feed your processes using one set of queues, and to get data from them using another queues. You may try to have one queue for getting data from all the children processes to the main process. But definitely separate the queues from the main to the children process. – zmo May 08 '14 at 19:03