After browsing through many discussions on the same/similar topics, I still can't solve my problem, hence I would like to post it below.
The following is a MWE for what I would like to parallelize, which is to solve a set of independent linear equations (nI+mat)x=y
parametrized by n=0,1,2
with fixed arrays mat
and y
. Note that the arrays are declared to be global
with the hope that they can be accessed by different processes/pools (see below). But I don't think it works and this is the core of the question: How to share big numpy arrays for different processes/pools to avoid communication overhead?
import numpy as np
import time
import os
N = 2000
num = 3
global mat
mat = np.random.rand(N, N)
global y
y = np.random.rand(N,1)
# Functions to be parallelized num of times
def fun(n):
print(f"{n}-th job is run on child {os.getpid()} of parent {os.getppid()}")
newmat = n * np.eye(N) + mat
return np.linalg.solve(newmat, y)
# Approach 1: no parallel
def main():
start_time = time.time()
res = []
for i in range(num):
res.append(fun(i))
print(f"Approach 1: Time elapsed = {time.time()-start_time} sec")
return res
main()
I tried the following three approaches to parallelize it: Pool
, Process
and Process
with Array
and numpy.frombuffer
. See below.
from multiprocessing import Process, set_start_method, Queue, Pool, cpu_count, Array, RawArray
set_start_method('fork')
# Approach 2: with pool
def main2():
start_time = time.time()
pool = Pool(cpu_count())
res = pool.map(fun, range(num))
print(f"Approach 2: Time elapsed = {time.time()-start_time} sec")
pool.close()
return res
main2()
# Approach 3: with process
def fun2(i, output):
output.put(fun(i))
def main3():
start_time = time.time()
output = Queue()
processes = [Process(target=fun2, args=(i, output)) for i in range(num)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
res = [output.get() for _ in processes]
print(f"Approach 3: Time elapsed = {time.time()-start_time} sec")
return res
main3()
# Approach 4: with process with Array, numpy.frombuffer,
def fun3(n, output, mat, y):
print(f"{n}-th job is run on child {os.getpid()} of parent {os.getppid()}")
mat2 = np.frombuffer(mat.get_obj())
newmat = n * np.eye(N) + mat2.reshape((N, N))
output.put(np.linalg.solve(newmat, y))
def main4():
mat2 = Array('d', mat.flatten())
y2 = Array('d', y)
start_time = time.time()
output = Queue()
processes = [Process(target=fun3, args=(i, output, mat2, y2)) for i in range(num)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
res = [output.get() for _ in processes]
print(f"Approach 4: Time elapsed = {time.time()-start_time} sec")
return res
main4()
Neither of these approaches works and I got
0-th job is run on child 8818 of parent 3421
1-th job is run on child 8818 of parent 3421
2-th job is run on child 8818 of parent 3421
Approach 1: Time elapsed = 0.2891273498535156 sec
0-th job is run on child 8819 of parent 8818
1-th job is run on child 8820 of parent 8818
2-th job is run on child 8821 of parent 8818
Approach 2: Time elapsed = 3.6278929710388184 sec
0-th job is run on child 8832 of parent 8818
1-th job is run on child 8833 of parent 8818
2-th job is run on child 8834 of parent 8818
Approach 3: Time elapsed = 4.243804931640625 sec
0-th job is run on child 8843 of parent 8818
1-th job is run on child 8844 of parent 8818
2-th job is run on child 8845 of parent 8818
Approach 4: Time elapsed = 4.745251893997192 sec
This summarizes all the approaches I have seen so far. I am aware of that there is a SharedMemory
in Multiprocessing, which it is not available to python 3.7.2. If that could solve the problem, I would be very happy to see how it works.
Really thanks for anyone to read through the whole post, and any helps are appreciated. And in case it is important, I am using a Mac with Apple M1 chip, macOS Monterey.
Update 1: per @AKX's point, I removed the print(n-th job)
line, and make N=10000
, and the results are
Approach 1: Time elapsed = 23.812573194503784 sec
Approach 2: Time elapsed = 126.91087889671326 sec
for Approach 3, it has taken for around 5 minutes which I have to cut it off. So the time overhead is pretty large for large N.