Why does my multiprocessing job in Python take longer than a single process?

Question

I have to run my code that takes about 3 hours to complete 100 times, which amounts to a total computation time of 300 hours. This code is expensive and includes taking Laplacians, contour points, and outputting plots. I also have access to a computing cluster, which grants me access to 100 cores at once. So I wondered if I could run my 100 jobs on 100 individual cores at once --- no communication between cores --- and reduce the total computation time to somewhere around 3 hours still.

My online reading took me to multiprocessing.Pool, where I ended up using Pool.apply_async(). Quickly, I realized that the average completion time per task was way over 3 hours, the completion time of one task prior to parallelizing. Further research revealed it might be an overhead issue one has to deal with when parallelizing, but I don't think that can account for an 8-fold increase in computation time.

I was able to reproduce the issue on a simple, and shareable piece of code:

import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import time

def f(x):
'''Junk to just waste computing power!'''
   arr = np.ones(shape=(10000,10000))*x
   val1 = arr*arr
   val2 = np.sqrt(val1)
   val3 = np.roll(arr,-1) - np.roll(arr,1)
   plt.subplot(121)
   plt.imshow(arr)
   plt.subplot(122)
   plt.imshow(val3)
   plt.savefig('f.png')
   plt.close()
   print('Done')

N = 1
pool = Pool(processes=N)
t0 = time.time()
for i in range(N):
   pool.apply_async(f,[i])
pool.close()
pool.join()

print('Time taken = ', time.time()-t0)

The machine I'm on has 8 cores, and if I've understood my online reading correctly, setting N=1 should force Python to run the job exclusively on one core. Looking at the completion times, I get:

N=1 || Time taken = 7.964005947113037

N=7 || Time taken = 40.3266499042511.

Put simply, I don't understand why the times aren't the same. As for my original code, the time difference between single-task and parallelized-task is about 8-fold.

[Question 1] Is this all a case of overhead (whatever it entails!)?

[Question 2] Is it at all possible to achieve a case where I can get the same computation time as that of a single-task for an N-task job running independently on N cores?

For instance, in SLURM you could do sbatch --array=0-N myCode.slurm, where myCode.slurm would be a single-task Python code, and this would do exactly what I want. But, I need to achieve this same outcome from within Python itself.

Thank you in advance for your help and input!

Question: did you try with N=2, N=3, etc.? Background processes might be sharing the processors and all the context switching on each processor might be slowing things down. — Chaos_Is_Harmony, Aug 31 '21 at 00:54
My attempts at CPU-based multiprocessing jobs in python have always failed in a similar way due to the Global Interpreter Lock. — njp, Aug 31 '21 at 01:06
@njp, the GIL should only be a concern for running multiple threads in a single process. It's not shared between separate processes as this example has. — BowlOfRed, Aug 31 '21 at 01:23

BowlOfRed · Accepted Answer · 2021-08-31T01:26:27.390

1

I suspect that (perhaps due to memory overhead) your program is simply not CPU bound when running multiple copies, therefore process parallelism is not speeding it up as much as you might think.

An easy way to test this is just to have something else do the parallelism and see if anything is improved.

With N = 8 it took 31.2 seconds on my machine. With N = 1 my machine took 7.2 seconds. I then just fired off 8 copies of the N = 1 version.

$ for i in $(seq 8) ; do python paralleltest &  done
...
Time taken =  32.07925891876221
Done
Time taken =  33.45247411727905
Done
Done
Done
Done
Done
Done
Time taken =  34.14084982872009
Time taken =  34.21410894393921
Time taken =  34.44455814361572
Time taken =  34.49029612541199
Time taken =  34.502259969711304
Time taken =  34.516881227493286

I also tried this with the entire multiprocessing stuff removed and just a single call to f(0) and the results were quite similar. So the python multiprocessing is doing what it can, but this job has constraints beyond just CPU.

When I replaced the code with something less memory intensive (math.factorial(1400000)), then the parallelism looked a lot better. 14.0 seconds for 1 copy, 16.44 seconds for 8 copies.

SLURM has the ability to make use of multiple hosts as well (depending on your configuration). The multiprocessing pool does not. This method will be limited to the resources on a single host.

edited Aug 31 '21 at 01:26

answered Aug 31 '21 at 01:12

BowlOfRed

339
2
11

not just memory overhead, but unless you have lots of ram, this function as written can easily slide into the page file on normal computers... N=7 pushed over 25 Gigs on my system. I could probably have freed up a bit more, but that's more than most people have in the first place anyway. Important to know how much ram is in each node in your cluster. There may also be further limits to ram per process depending on architecture. – Aaron Aug 31 '21 at 02:19
@Aaron and BowlOfRed: Thank you! I wasn't thinking about memory usage, and in that respect, the simple example provided here is not similar to my underlying Python code. While the example here uses about 1.5GB of ram (and on my 8GB ram machine I can see the memory being a bottleneck), my underlying Python code uses about 100MB of ram per task. I run 48 of them on a cluster node with 192GB of ram, so I don't think memory is a limiting factor. Do I understand *memory overhead* incorrectly, of does this information mean something else is the culprit for this slow down? – Ptheguy Aug 31 '21 at 04:12
Memory was the most obvious thing going on to me in this case, but I didn't really analyze it completely. The specifics of your code would matter. All I could say at this point is that the method you are using for the multiprocessing seems completely valid. Maybe if you could make a test case that is more representative of your actual code and reproduces the problem, that could be analyzed. – BowlOfRed Aug 31 '21 at 05:22

score 1 · Answer 2 · answered Aug 31 '21 at 02:11

This occurs at least partly because you're already taking advantage of most of your cpu in just one process. Numpy is generally built to take advantage of multiple threads without the need for multiprocessing. The benefit is greatest when a significant amount of time is spent inside numpy functions or else the comparatively slower python will dominate the time taken in-between numpy calls. When you add tasks in parallel with the CPU already fully loaded, each one runs slower. See this question on adjusting the number of threads numpy uses, and possibly play with setting it to only 1 thread to convince yourself this is the case.

Why does my multiprocessing job in Python take longer than a single process?

2 Answers2