1

Consider the following toy example. I am making a parallelization to compute the value of square function while making some modifications of the shared object A.

import multiprocessing

A = [1, 2]

def square(i):

    A[i] = 2 + A[i]

    return i * i

square(0)
square(1)

print(A)

A = [1, 2]

multiprocessing.Pool().map(square, [0, 1])

print(A)

The output is the following

[3, 4]
[1, 2]

But I expect it to be

[3, 4]
[3, 4]

As indicated above, the serial version of square function managed to change A from [1, 2] to [3, 4]. But the pool.map failed to modify A. So I am asking how to modify the shared object using pool().map. Thanks in advance!

J. Lin
  • 139
  • 3
  • 11
  • Depending on your platform (or, actually, your start method, but usually you leave that to the default, which depends on your platform), `A` is either not a shared object in the first place, or it is a shared object but you need locks around it. The right way to handle this is usually either [shared memory](https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes), a [`Manager`](https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes), or rewriting your code so it passes immutable objects back instead. – abarnert May 31 '18 at 23:08

1 Answers1

3

If your startmethod is spawn or forkserver, then A is not a shared object in the first place. And if you're on Windows, spawn is the default, and only choice.

If your startmethod is fork, then A may be a shared object—but if it is, it isn't actually safe to mutate it without any locks.

As explained in Sharing state between processes, you should try as hard as possible to not need shared objects—it's kind of the whole point of multiprocessing that the processes are isolated from each other—but if you really do need them, you have to do something a bit more complicated.

The first option is using shared memory. In this case, you're using your list as a fixed-sized array of small ints, which you can simulate with an Array('i', [1, 2]), which you can use exactly as in the example in the docs. For more complicated cases, you often need to add a Lock or other synchronization mechanism to protect the shared memory. This is pretty efficient and simple, but it only works when your shared data is something that can be mapped to low-level types like this.

The second option is using a Manager.list([1, 2]), which you can use exactly as in the very next example in the docs. This is a lot less efficient—it works by creating a queue and passing messages back and forth that tell the main process to do the work whenever you want to access or mutate the list—but it has the advantage of being dead simple to use.


But again, it's usually better to not do either of these things, and instead rewrite your code to not need shared data in the first place. Usually this means returning more data from the pool tasks, and then having the main process assemble the returned values in some way. Of course this is tricky if, e.g., other tasks inherently need to see the mutated values. (In such cases, you'd often have to build 80% of what Manager is doing, at which point you might as well just use Manager…). But in your toy example, that isn't the case. (And, in fact, when you think that's unavoidably necessary, it often means you haven't thought through how nondeterminism is going to affect your algorithm, and it wouldn't have worked anyway…)

Here's an example of how you could do this with your toy problem:

import multiprocessing

def square(i, aval):
    # actual return value, i, and value to set A[i] to
    return i*i, i, 2+aval

A = [1, 2]
# pass each A[i] into the function
for result, i, aval in multiprocessing.Pool().starmap(square, zip([0, 1], A)):
    # get the new A[i] out of the function and store it
    A[i] = aval    
print(A)
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thanks for such detailed answer! Following your answer and thoughts, I found a similar question in stackoverflow that has been well solved, which I will post below. My case is even simpler than that. A is a list of independent but complicated objects (linear programming models). Process i modifies the i^th object independently. So A is seen by all processes but each entry of A is manipulated independently. In the end, I want to get the modified A. I am wondering if there are more concise ways to do it. – J. Lin Jun 01 '18 at 00:25
  • https://stackoverflow.com/questions/1675766/how-to-combine-pool-map-with-array-shared-memory-in-python-multiprocessing – J. Lin Jun 01 '18 at 00:25
  • @J.Lin Then the simplest way to do that is to not share the list at all. Just return the value. Then. In the main process, instead of ignoring the return values from `map`, use them to modify `A[i]`. Or, maybe even more simply, just do something like `A[:] = pool.map(…)`, or even `A = list(pool.map(…))`. – abarnert Jun 01 '18 at 00:30
  • The return value from map is the value of my square function. In the above toy example, it's [0,1]. But I want [3,4]. – J. Lin Jun 01 '18 at 00:38
  • @J.Lin Sure, but in your toy example, you’re not using the return value for anything, so you can change it to return something else instead. And if your real code actually _does_ have a return value that you use, then you can change it to return a pair of two values—the actual return value, and the `A[i]` value—and then change the main code that was doing, say, `for ret in pool.map(…):` to do `for ret, aval in pool.map(…):`. (You may actually need to return `i` as well, or you may be able to use `enumerate(pool.map(…))`, or you may not need it at all if you can just build a new A.) – abarnert Jun 01 '18 at 00:42
  • @J.Lin I've edited the answer to show how to do this with your toy example. If it's not obvious how to adapt that to your real example, edit your question to provide a slightly more complex [mcve] that shows why it's not obvious, and I can edit my answer. – abarnert Jun 01 '18 at 00:52
  • Really smart answer. Thank you! – J. Lin Jun 01 '18 at 20:23