11

I have a machine with 24 physical cores (at least I was told so) running Debian: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux. It seems to be correct:

usr@machine:~/$ cat /proc/cpuinfo  | grep processor
processor   : 0
processor   : 1
<...>
processor   : 22
processor   : 23

I had some issues trying to load all cores with Python's multiprocessing.pool.Pool. I used Pool(processes=None); the docs say that Python uses cpu_count() if None is provided.

Alas, only 8 cores were 100% loaded, others remained idle (I used htop to monitor CPU load). I thought that I cannot cook Pools properly and tried to invoke 24 processes "manually":

print 'Starting processes...'
procs = list()
for param_set in all_params:  # 24 items
    p = Process(target=_wrap_test, args=[param_set])
    p.start()
    procs.append(p)

print 'Now waiting for them.'
for p in procs:
    p.join()

I had 24 "greeting" messages from the processes I started:

Starting processes...
Executing combination: Session len: 15, delta: 10, ratio: 0.1, eps_relabel: 0.5, min_pts_lof: 5, alpha: 0.01, reduce: 500
< ... 22 more messages ... >
Executing combination: Session len: 15, delta: 10, ratio: 0.1, eps_relabel: 0.5, min_pts_lof: 7, alpha: 0.01, reduce: 2000
Now waiting for them.

But still only 8 cores were loaded:

enter image description here

I've read here on SO that there may be issues with numpy, OpenBLAS and multicore execution. This is how I start my code:

OPENBLAS_MAIN_FREE=1 python -m tests.my_module

And after all imports I do:

os.system("taskset -p 0xff %d" % os.getpid())

So, here is the question: what should I do to have 100%-load on all cores? Is this just my poor Python usage or it has something to do with OS limitations on multicore machines?

UPDATED: one more interesting thing is some inconsistency within htop output. If you look at the image above, you'll see that the table below the CPU load bars shows 30-50% load for much more than 8 cores, which is definitely different from what load bars say. Then, top seems to agree with those bars: 8 cores 100%-loaded, others idle.

UPDATED ONCE AGAIN:

I used this rather popular post on SO when I added the os.system("taskset -p 0xff %d" % os.getpid()) line after all imports. I have to admit that I didn't think too much when I did that, especially after reading this:

With this line pasted in after the module imports, my example now runs on all cores

I'm a simple man. I see "works like a charm", I copy and paste. Anyway, while playing with my code I eventually removed this line. After that my code began executing on all 24 cores for the "manual" Process starting scenario. For the Pool scenario the same problem remained, no matter whether the affinity trick was used or not.

I don't think it's a real answer 'cause I don't know what the issue is with Pool, but at least I managed to get all cores fully loaded. Thank you!

Community
  • 1
  • 1
oopcode
  • 1,912
  • 16
  • 26
  • Are you sure this is 1 processor board? I've heard the rumor that python can't do multiprocessing (use more than 1 CPU) – deathangel908 Jul 09 '15 at 14:22
  • @deathangel908 I suppose it has 4 CPU with 6 cores each. But it already uses more than 6 cores, so it's not the issue I guess. – oopcode Jul 09 '15 at 14:24
  • @deathangel908 that is mistaken: there are problems getting threading to use all machine resources, but multiprocessing, using separate unix processes, is not limited by Python. My guess is there is some kernel setting that isn't set properly as the OP guessed too. – msw Jul 09 '15 at 14:26
  • Have you tried using `top` and hitting `1`? It will show each individual core. I've experienced bugs in `htop` before, but `top` is always reliable. – notbad.jpeg Jul 09 '15 at 15:02
  • @notbad.jpeg I already checked that. See the update at the end of the post: `top` tells me that 8 cores are 100%-loaded, the rest are idle. – oopcode Jul 09 '15 at 15:04
  • I think you need an independent check. Can you write a shell script which runs a single threaded program, but starts 24 instances in different processes? – quamrana Jul 09 '15 at 15:07
  • @quamrana Have a look at the latest update, please :) – oopcode Jul 09 '15 at 16:11

2 Answers2

4

Even though you solved the issue I'll try to explain it to clarify the ideas.

For what I read around, numpy does a lot of "magic" to improve performance. One of the magic tricks is to set the CPU affinity of the process.

The CPU affinity is an optimisation of the OS scheduler. It basically enforces a given process to be always run on the same CPU core.

This improves performance reducing the amount of times the CPU cache is invalidated and increasing the benefits from reference locality. On high computational tasks these factors are indeed important.

What I don't like of numpy is the fact that it does all this implicitly. Often puzzling developers.

The fact that your processes where not running on all the cores was due to the fact that numpy sets the affinity to the parent process when you import the module. Then, when you spawn the new processes the affinity is inherited leading to all the processes fighting for few cores instead of efficiently using all the available ones.

The os.system("taskset -p 0xff %d" % os.getpid()) command instruct the OS to set the affinity back on all the cores solving your issue.

If you want to see it working on the Pool you can do the following trick.

import os
from multiprocessing import Pool


def set_affinity_on_worker():
    """When a new worker process is created, the affinity is set to all CPUs"""
    print("I'm the process %d, setting affinity to all CPUs." % os.getpid())
    os.system("taskset -p 0xff %d" % os.getpid())


if __name__ == '__main__':
    p = Pool(initializer=set_affinity_on_worker)
    ...
noxdafox
  • 14,439
  • 4
  • 33
  • 45
3

In os.system("taskset -p 0xff %d" % os.getpid()), 0xff is essentially a hexadecimal bitmask, corresponding to 1111 1111. Each bit in the bitmask corresponds to a CPU core. The bit value 1 means that the process can be executed on the corresponding CPU core. Therefore, to run on 24 cores you should use a mask of 0xffffff instead of 0xff.

Correct command:

os.system("taskset -p 0xffffff %d" % os.getpid())
yraghu
  • 31
  • 2