dramatic slow down using multiprocess and numpy in python

Question

I write a python code for Q-learning algorithm and I have to run it multiple times since this algorithm has random output. Thus I use multiprocessing module. The structure of the code is as follows

import numpy as np
import scipy as sp
import multiprocessing as mp
# ...import other modules...

# ...define some parameters here...

# using multiprocessing
result = []
num_threads = 3
pool = mp.Pool(num_threads)
for cnt in range(num_threads):
    args = (RL_params+phys_params) # arguments
    result.append(pool.apply_async(Q_learning, args))

pool.close()
pool.join()

There is no I/O operation in my code and my work station has 6 cores (12 threads) and enough memory for this job. When I run the code with num_threads=1, it takes me only 13 seconds and this mission only occupies 1 thread with CPU usage 100% (using top command).

click to see picture of CPU status

However, if I run it with num_threads=3 (or more), it shall takes more than 40 seconds and this mission will occupy 3 threads with each thread use 100% CPU core.

click to see picture of CPU status

I can't understand this slowing down because there is no parallelization in all self-defined functions and no I/O operation. It is also interesting to notice that when num_threads=1, CPU usage is always less than 100%, but when num_threads is larger than 1, CPU usage may sometimes be 101% or 102%.

On the other hand, I wrote another simple test file which does not import numpy and scipy, then this problem never show. I have noticed this question why isn't numpy.mean multithreaded? and it seem my problem is due to the automatic parallelization of some methods in numpy (such dot). But as I shown in the pictures, I can't see any parallelization when I run a single job.

this is normal, because of many factors, context switching, serialization of the message passing is not free nor is working around the GIL, so it is expected that CPU intensive things will be slower. — , Nov 19 '17 at 18:29

score 3 · Answer 1 · answered Nov 19 '17 at 18:29

3

When you use a multiprocessing pool, all the arguments and results get sent through pickle. This can be very processor-intensive and time-consuming. That could be the source of your problem, especially if your arguments and/or results are large. In those cases, Python may spend more time pickling and unpickling the data than it does running computations.

However, numpy releases the global interpreter lock during computations, so if your work is numpy-intensive, you may be able to speed it up by using threading instead of multiprocessing. That would avoid the pickling step. See here for more details: https://stackoverflow.com/a/38775513/3830997

answered Nov 19 '17 at 18:29

Matthias Fripp

17,670
5
28
45

I also write a shell script to submit multiple job at one time instead of using multiprocessing, and it has the same problem. Does this also due to pickcle as you described ? @Matthlas Fripp – JunjieChen Nov 19 '17 at 18:49
When launching multiple separate jobs, I would I'd expect nearly linear scaling for an "embarrassingly parallel" task like this, e.g., 4 jobs on 4 cores should complete in the same time as 1 job on 1 core. There shouldn't be a pickle problem. Is it possible you're running short of memory? If your jobs all together require more memory than your system has, then your system will begin to swap to disk or use compressed memory (Mac). This can cause severe slowdowns. – Matthias Fripp Nov 19 '17 at 20:08
As I show in the figure, each thread occupies less than 1% space of memory, thus I think the memory is enough for this job. In fact, though there are multiple jobs, they are independent and there is no communication between them. – JunjieChen Nov 20 '17 at 05:40
Sorry, I'm drawing a blank on this. From the screenshots, it looks like your code always runs single-threaded. So there doesn't seem to be a possibility of processor contention between parallel tasks, since you have plenty of cores available. Since they don't do any I/O and use relatively little memory, I wouldn't expect contention for disk or memory either. Are you able to provide a minimum example that shows this problem so other people can troubleshoot it? https://stackoverflow.com/help/mcve – Matthias Fripp Nov 21 '17 at 00:25

dramatic slow down using multiprocess and numpy in python

1 Answers1

Linked