I am attempting to implement multiprocessing in Python (Windows Server 2012) and am having trouble achieving the degree of performance improvement that I expect. In particular, for a set of tasks which are almost entirely independent, I would expect a linear improvement with additional cores.
I understand that--especially on Windows--there is overhead involved in opening new processes [1], and that many quirks of the underlying code can get in the way of a clean trend. But in theory the trend should ultimately still be close to linear for a fully parallelized task [2]; or perhaps logistic if I were dealing with a partially serial task [3].
However, when I run multiprocessing.Pool on a prime-checking test function (code below), I get a nearly perfect square-root relationship up to N_cores=36
(the number of physical cores on my server) before the expected performance hit when I get into the additional logical cores.
Here is a plot of my performance test results :
( "Normalized Performance" is [ a run time with 1 CPU-core ] divided by [ a run time with N CPU-cores ] ).
Is it normal to have this dramatic diminishing of returns with multiprocessing? Or am I missing something with my implementation?
import numpy as np
from multiprocessing import Pool, cpu_count, Manager
import math as m
from functools import partial
from time import time
def check_prime(num):
#Assert positive integer value
if num!=m.floor(num) or num<1:
print("Input must be a positive integer")
return None
#Check divisibility for all possible factors
prime = True
for i in range(2,num):
if num%i==0: prime=False
return prime
def cp_worker(num, L):
prime = check_prime(num)
L.append((num, prime))
def mp_primes(omag, mp=cpu_count()):
with Manager() as manager:
np.random.seed(0)
numlist = np.random.randint(10**omag, 10**(omag+1), 100)
L = manager.list()
cp_worker_ptl = partial(cp_worker, L=L)
try:
pool = Pool(processes=mp)
list(pool.imap(cp_worker_ptl, numlist))
except Exception as e:
print(e)
finally:
pool.close() # no more tasks
pool.join()
return L
if __name__ == '__main__':
rt = []
for i in range(cpu_count()):
t0 = time()
mp_result = mp_primes(6, mp=i+1)
t1 = time()
rt.append(t1-t0)
print("Using %i core(s), run time is %.2fs" % (i+1, rt[-1]))
Note: I am aware that for this task it would likely be more efficient to implement multithreading, but the actual script for which this one is a simplified analog is incompatible with Python multithreading due to GIL.