My numpy build doesn't use multiple CPU cores

Question

Note: Read this question carefully. I understand that CPython has the GIL. Numpy is normally not limited by the GIL for most functions.

Update: This turns out to be the same issue described in this question. If you link numpy against OpenBLAS, it will set the CPU affinity of the whole process as soon as you import numpy. This can be fixed with a flag to the OpenBLAS build.

My app uses numpy, which I build from source (i.e. without easy_install, etc.). Normally, my custom build works just fine. Recently though, I did something (to my build? to my OS?) that is preventing numpy from using multiple CPU cores.

Consider this simple program, which does the following:

Run a silly workload in a worker thread.
Run the same workload again twice more in two parallel threads.

On a properly working numpy install, the second (parallel) step is almost as fast as the first step. But in my special build, the second step takes twice as long! Only 1 CPU is used. It's acting as if numpy.sqrt doesn't release the GIL, but I know it should.

Man, I don't know how to break a numpy build like this even if I wanted to. It refuses to use more than 1 CPU core! How did I do this? How do I fix it?

Edit: More details: numpy-1.7.0, gcc, Linux (Fedora 16), but I don't think those specifics are too important. I've built with this configuration before without running into this problem. I guess I'm wondering if there's a particular OS or python setting that can cause behavior like this.

import numpy, threading, time

a1 = numpy.random.random((500,500,200)).astype(numpy.float32)
a2 = numpy.random.random((500,500,200)).astype(numpy.float32)
a3 = numpy.random.random((500,500,200)).astype(numpy.float32)

def numpy_workload(name, a):
    print "starting numpy_workload " + name
    for _ in range(10):
        numpy.sqrt(a)
    print "finished numpy_workload " + name

t1 = threading.Thread(target=lambda: numpy_workload("1", a1))
t2 = threading.Thread(target=lambda: numpy_workload("2", a2))
t3 = threading.Thread(target=lambda: numpy_workload("3", a3))

start = time.time()
t1.start()
t1.join()
stop = time.time()
print "Single thread done after {} seconds\n".format( stop - start )

start = time.time()
t2.start()
t3.start()
t2.join()
t3.join()
stop = time.time()
print "Two threads done after {} seconds\n".format( stop - start )

I guess the first question is how did you build it? Which numpy version? Which compilers? Which platform? Did you disable BLAS, LAPACK or ATLAS for the build? — John Lyon, Oct 22 '13 at 21:15
I built using OpenBLAS, and then tried again using ATLAS. Do you think that the BLAS implementation I'm using could affect numpy's multithreading behavior? I wondered about that, but my best guess is no. Do you suspect otherwise? — Stuart Berg, Oct 22 '13 at 21:20
It could, I'm not sure. I know the default Ubuntu installation of numpy uses ATLAS. There's an answer that [details some of the options](http://stackoverflow.com/questions/5260068/multithreaded-blas-in-python-numpy), but I also doubt that your install of OpenBLAS is the issue here. I was more suspecting a compilation flag of some kind. — John Lyon, Oct 22 '13 at 21:25
[See this too](http://stackoverflow.com/questions/11787657/supposed-automatically-threaded-scipy-and-numpy-functions-arent-making-use-of-m) - what does `numpy.show_config()` say? Can you check your `.bashrc` for the presence of `MKL_NUM_THREADS=1` or similar? — John Lyon, Oct 22 '13 at 21:28
Whoops, that would have been for `MKL`, an alternative BLAS implementation. try doing an `export OPENBLAS_NUM_THREADS=4` and re-running your test. Or if you compiled OpenBLAS with `USE_OPENMP=1`, you should set `OMP_NUM_THREADS` instead of `OPENBLAS_NUM_THREADS`. — John Lyon, Oct 22 '13 at 21:35
Crap, I'm on a bus now, won't be home to try that for 30 minutes. Thanks for your help. — Stuart Berg, Oct 22 '13 at 21:37
I am skeptical that BLAS has much to do with it, though. The sqrt function in my example shouldn't rely on BLAS. In fact, I suppose there's a chance that I'm having some general issue with python, not numpy specifically. — Stuart Berg, Oct 22 '13 at 21:40
Could be [setting your CPU affinity](https://stat.ethz.ch/pipermail/r-sig-hpc/2012-April/001353.html), potentially disabling the use of multiple cores across all of python. This can be fixed by rebuilding OpenBLAS with `NO_AFFINITY = 1`, again not sure if this is the cause, but it's certainly possible. — John Lyon, Oct 22 '13 at 22:08
Sounds an awful lot like [this issue](http://stackoverflow.com/questions/15639779/what-determines-whether-different-python-processes-are-assigned-to-the-same-or-d). Can you run processes on multiple cores if you don't import numpy (or any other module that imports numpy functions into its namespace)? Does the problem go away if you call `os.system("taskset -p 0xff %d" % os.getpid())` after importing numpy? — ali_m, Oct 22 '13 at 22:24
@ali_m, Ah-ha! Using that line of code, the problem disappears. The strange thing is that I switched my build from OpenBLAS to ATLAS. (I rebuilt everything from scratch, including Python and numpy.) And the problem didn't go away. As far as you know, does ATLAS also mess with CPU affinity? — Stuart Berg, Oct 22 '13 at 23:40
Also, thanks, @jozzas, it looks like your hunch about CPU affinity was correct. Tomorrow, I will try compiling OpenBLAS with that flag as you suggested. — Stuart Berg, Oct 22 '13 at 23:41
You don't necessarily have to re-compile OpenBLAS - if you're using a recent version you can also disable the CPU affinity-resetting behaviour at run-time by setting the environment variable `OPENBLAS_MAIN_FREE=1` — ali_m, Oct 23 '13 at 08:31
Thanks for the assistance, guys. I thought I had eliminated OpenBLAS as a potential problem because I built ATLAS, but I was wrong. My numpy build found libraries for both OpenBLAS and ATLAS, and apparently chose to link against OpenBLAS. Then OpenBLAS manipulated the CPU affinity as you suspected. I can fix it by either: (1) properly linking against ATLAS instead of OpenBLAS, (2) Rebuilding OpenBLAS with NO_AFFINITY=1, or (3) using the OPENBLAS_MAIN_FREE env variable as described above. — Stuart Berg, Oct 23 '13 at 13:33
It looks like this would be a good opportunity for you to answer your own question, or else mark it as a strict duplicate of 15639779. (I can't tell which would be more appropriate, or I'd do it myself.) — Air, Nov 15 '13 at 16:46

My numpy build doesn't use multiple CPU cores

0 Answers0