I'm creating a 58 GB graph in a Python program, which I then want to process through multiple (Linux) processes. For that I'm using a ProcessPoolExecutor
with a fork
context, passing to each process just an integer associated with the chunk associated with it.
with concurrent.futures.ProcessPoolExecutor(mp_context=mp.get_context('fork')) as executor:
futures = []
for start in range(0, len(vertices_list), BATCH_SIZE):
future = executor.submit(process_batch, start)
futures += [future]
The process_batch
function calculates a metric of the graph's vertices, and returns a small list of results. Both vertices_list
and graph
are global variables. The graph
variable is bound to a C++ Python extension module that implements the required data structure and calculation.
def process_batch(start):
results = []
for v in vertices_list[start:start + BATCH_SIZE]:
results.append((v, graph.some_metric(v)))
return results
This works fine with small samples, but when I run it on the 58 GB structure, the machine grinds to a halt, with memory getting swapped out to allow the forked processes to bring in more memory in ¼ MB chunks. (Using strace(1) I saw a few such mmap(2) calls.) I saw two comments (here and here) indicating that on Python fork(2) actually means copy on access, because just accessing an object will change its ref-count. This seems consistent with the behavior I'm observing: the forked processes probably attempting to create their own private copy of the data structure.
So my question is: does Python offer a way to share an arbitrary data structure among forked Python processes? I know about multiprocessing.shared_memory and the corresponding facilities, but these concern the built-in Array
and Value
types.