Preventing copies of dictionary when using pool.map()

Question

I have a function f(x) I want to evaluate over list of values xrange in parallel. The function does something like this:

def f(x, wrange, dict1, dict2):

    out_list = []

    v1 = dict1[x]

    for w in wrange:
        v2 = dict2[x-w]
        out_list += [np.dot(v1, v2)]

    return out_list

it takes values a matrix from a dictionary dict1, a vector from dictionary dict2, then multiplies them together. Now my normal approach for doing this in parallel would be something like this:

import functools
import multiprocessing

par_func = functools.partial(f, wrange=wrange, dict1=dict1, dict2=dict2)

p = multiprocessing.Pool(4)
ssdat = p.map(par_func, wrange)
p.close()
p.join()

Now when dict1 and dict2 are big dictionaries, this causes the code to fail with the error

File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

and I think this is because pool is making copies of the dict1 and dict2 for every evaluation of my function. Is there an efficient way, instead, to set these dictionaries as shared memory objects? Is map the best function to do this?

You may try with `threading` instead. If your `np.dot()` is computationally expensive you may not notice the difference between mutiprocessing and multithreading as NumPy should release the GIL while calculating the matrix product. — Peque, Mar 04 '19 at 16:10
This is *almost* a duplicate of [Python multiprocessing - Why is using functools.partial slower than default arguments?](https://stackoverflow.com/q/35062087/364696), but the focus is different (this asks how, the other question asks why); I'll leave it up to others to make that decision. — ShadowRanger, Mar 04 '19 at 16:39
@ShadowRanger thanks, this gave some good background to my question. — Jiles, Mar 05 '19 at 09:53

score 2 · Accepted Answer · answered Mar 04 '19 at 16:37

If you're on a fork-based system (read: Not Windows), one solution to this problem is to put the dicts in question in globals, write a function that doesn't take them as arguments, but simply access them from its own globals, and use that. functools.partial is, unfortunately, unsuited to this use case, but your use case makes it easy to replace with globals and a def-ed function:

import multiprocessing

# Assumes wrange/dict1/dict2 defined or imported somewhere at global scope,
# prior to creating the Pool
def par_func(x):
    return f(x, wrange, dict1, dict2)

# Using with statement implicitly terminates the pool, saving close/join calls
# and guaranteeing an exception while mapping doesn't leave the pool alive indefinitely
with multiprocessing.Pool(4) as p:
    ssdat = p.map(par_func, wrange)

Changes to dict1/dict2 won't be reflected between processes after the Pool is created, but you seem to be using it in a read-only fashion anyway, so that's not a problem.

If you're on Windows, or need to mutate the dicts, you can always make a multiprocessing.Manager and make dict proxies with the dict method of the manager (these are shared dicts, updated on key assignment), but it's uglier and slower, so I'd discourage it if at all possible.

John · Answer 2 · 2019-03-04T16:19:36.937

If you want to share memory between processes using multiprocessing, you'll need to explicitly share the objects with multiprocessing.Array. That's not ideal since you're wanting to access elements from dicts and finding the correct data might be time consuming. There are likely ways around this if it does become a problem for you.

As @Peque mentioned, the other option is to use threading. With threading, memory is automatically shared across all processes but you can run into performance issues due to the global interpreter lock (GIL). The GIL is Python's way to keep you thread-safe and avoid race conditions.

Preventing copies of dictionary when using pool.map()

2 Answers2