Python multiprocessing doesn't use all cores on RHEL6

Question

I have been trying to use the python multiprocessing package to speed up some physics simulations I'm doing by taking advantage of the multiple cores of my computer.

I noticed that when I run my simulation at most 3 of the 12 cores are used. In fact, when I start the simulation it initially uses 3 of the cores, and then after a while it goes to 1 core. Sometimes only one or two cores are used from the start. I have not been able to figure out why (I basically change nothing, except closing a few terminal windows (without any active processes)). (The OS is Red Hat Enterprise Linux 6.0, Python version is 2.6.5.)

I experimented by varying the number of chunks (between 2 and 120) into which the work is split (i.e. the number of processes that are created), but this seems to have no effect.

I looked for info about this problem online and read through most of the related questions on this site (e.g. one, two) but could not find a solution.

(Edit: I just tried running the code under Windows 7 and it's using all available cores alright. I still want to fix this for the RHEL, though.)

Here's my code (with the physics left out):

from multiprocessing import Queue, Process, current_process

def f(q,start,end): #a dummy function to be passed as target to Process
    q.put(mc_sim(start,end))

def mc_sim(start,end): #this is where the 'physics' is 
    p=current_process()
    print "starting", p.name, p.pid        
    sum_=0
    for i in xrange(start,end):
        sum_+=i
    print "exiting", p.name, p.pid
    return sum_

def main():
    NP=0 #number of processes
    total_steps=10**8
    chunk=total_steps/10
    start=0
    queue=Queue()
    subprocesses=[]
    while start<total_steps:
        p=Process(target=f,args=(queue,start,start+chunk))
        NP+=1
        print 'delegated %s:%s to subprocess %s' % (start, start+chunk, NP)
        p.start()
        start+=chunk
        subprocesses.append(p)
    total=0
    for i in xrange(NP):
        total+=queue.get()
    print "total is", total
    #two lines for consistency check:    
    # alt_total=mc_sim(0,total_steps)
    # print "alternative total is", alt_total
    while subprocesses:
        subprocesses.pop().join()

if __name__=='__main__':
    main()

(In fact the code is based on Alex Martelli's answer here.)

Edit 2: eventually the problem resolved itself without me understanding how. I did not change the code nor am I aware of having changed anything related to the OS. In spite of that, now all cores are used when I run the code. Perhaps the problem will reappear later on, but for now I choose to not investigate further, as it works. Thanks to everyone for the help.

I would try to first join the process and to calcluate the sum afterwards. Your solution looks strange to me. If it does not help, provide more debug output so that we can see where your processes get blocked. Have you checked how many processes are started? If only 3 cores are used, you can have only 3 processes or much more but sleeping ones. That's a diffrence which would be helpful to know. — Achim, Oct 03 '12 at 14:40
if `mc_sim` is implemented in pure python, rather than calling out to some non-python C code, then you're probably running into the [GIL](http://wiki.python.org/moin/GlobalInterpreterLock) (Global Interpreter Lock.) If you do drop into some C or Cython which doesn't require the GIL, you'll need to ensure the routine is properly flagged as e.g. `cdef void func(int a) nogil` (that's how you do it for Cython, anyway.) — tehwalrus, Oct 03 '12 at 14:50
@Achim, do you mean you would first join the processes (two lines starting with `while subprocesses: ...`) before calculating the final sum (the two lines starting with `for i in range(NP):`)? I just tried that and it does not seem to make a difference. — the.real.gruycho, Oct 03 '12 at 15:58
@tehwalrus I just tried running my code without any modification on a windows7 machine and it used all 8 cores. Doesn't this indicate that the problem is not in Python's GIL? — the.real.gruycho, Oct 03 '12 at 16:02
@tehwalrus The GIL is only relevant when dealing with threads of the same process. Multiprocessing avoids using threads specifically for that reason. — tylerl, Oct 03 '12 at 16:13
The problem can't be in `multiprocessing` just because it's said in [documentation](http://docs.python.org/library/multiprocessing.html), that it doesn't use GIL. I can't find an issue in your code, so if you won't find it too, I suggest you take a look at [parallel python](http://www.parallelpython.com/) module. It's easy, it's fast, it doesn't use GIL and it probably can help you (again, if you won't find an issue). — aga, Oct 03 '12 at 16:13
The problem may be with the O/S. The `multiprocessing` module only guarantees that you'll be working with multiple processes. It's up to the O/S to distribute those processes between the available cores, AFAIK. — mpenkov, Oct 03 '12 at 16:22
@tylerl thanks for the clarification - apologies for clouding the issue. — tehwalrus, Oct 03 '12 at 16:48
@Tropcho: Yes, that's what I had in mind. Please check the max. number of processes running at the same time. You could also write the current time in both loops at the end. That will show you, if there is some (currentyl unexplainable) locking going on. — Achim, Oct 03 '12 at 17:17
As a general response, assuming you're not using locks, you may wish to examine your memory and disk requirements and availability, particularly the capacity, speed, and latency. You didn't say if you tried with both RHEL 6 and Windows 7 on the same or different machines. Only a comparison on the same machine is relevant. — Asclepius, Oct 04 '12 at 16:36
@Achim Finally, I think that it's a bug somewhere. I used `multiprocessing.current_process()` to check which processes start (I'll edit the code above to show that). All processes did start immediately, but nevertheless only a few of the cores (max. 3, but sometimes 1 or 2) were active. A few hours later I ran the same code, and all 12 cores were used. I then put the computer to sleep and immediately back on, and ran the code again. This time again only a few of the cores were used. Confusing. — the.real.gruycho, Oct 04 '12 at 17:35
@A-B-B Memory is about 62GiB and ~75% of that is used with total_steps=10**9. The `range(start,end)` list can use a lot of memory if `end-start` is a big number. To fix that one can use `while (start — the.real.gruycho, Oct 04 '12 at 17:56
@Tropcho: Measure required time from start to end in each subprocess. The figures will probably confirm, that some process are blocked/sleeping and will take much longer that the average. If it's confirmed, try to remove all possiblities for logging. I would remove the queue, would hardcode values and would check how the code behaves. My guess would be, that the access to the queue is somehow blocking. — Achim, Oct 04 '12 at 18:04
try to run it on python 2.7. Add mp.log_to_stderr().setLevel(logging.DEBUG), use mp.Pool() to avoid managing processes by hand, try mp.Manager().Queue() (in case it is a bug in mp.Queue()) — jfs, Oct 04 '12 at 18:19
@J.F. Sebastian: I've found no significant bug in `multiprocessing.Queue` in Python 2.6 on RHEL 5 or 6. I've used it just fine in multiple projects. @Tropcho: Of course you shouldn't be using `range` in Python 2.x - use `xrange` instead. When using many processes, use `ps` or `top` to examine their typical run state. — Asclepius, Oct 04 '12 at 19:05
@Achim Yes, I forgot to mention this, I timed the processes and when only one core was used they all took approximately the same amount of time. So probably they are concurrently running on the same core. — the.real.gruycho, Oct 05 '12 at 18:10
@J.F.Sebastian: I tried python 2.7, same thing as with python 2.6. I will try the rest of the stuff you suggest a bit later, there's something more urgent at the moment. — the.real.gruycho, Oct 05 '12 at 18:10
Hey all, pardon me for not providing an update earlier: eventually the problem resolved itself without me understanding how. I did not change the code nor am I aware of having changed anything related to the OS. In spite of that, now all cores are used when I run the code. Perhaps the problem will reappear later on, but for now I choose to not investigate further, as it works. Thanks to everyone for the help. — the.real.gruycho, Feb 07 '13 at 09:33

score 1 · Answer 1 · answered Oct 06 '12 at 06:25

1

I ran Your example on Ubuntu 12.04 x64 (kernel 3.2.0-32-generic) with Python version 2.7.3 x64 on i7 processor and all 8 cores reported by system were fully overload (based on htop observation), so Your problem, Sir, is based on OS implementation, and code is good.

answered Oct 06 '12 at 06:25

WBAR

4,924
7
47
81

Right. It looks like a bug that has been fixed. See the Original Poster's comment: Tropcho on 2013 Feb 7 at 9:33 that it started working for him also on RHEL. – nealmcb Oct 30 '15 at 16:57

Python multiprocessing doesn't use all cores on RHEL6

1 Answers1