Note: Read this question carefully. I understand that CPython has the GIL. Numpy is normally not limited by the GIL for most functions.
Update: This turns out to be the same issue described in this question. If you link numpy against OpenBLAS, it will set the CPU affinity of the whole process as soon as you import numpy. This can be fixed with a flag to the OpenBLAS build.
My app uses numpy, which I build from source (i.e. without easy_install
, etc.). Normally, my custom build works just fine. Recently though, I did something (to my build? to my OS?) that is preventing numpy from using multiple CPU cores.
Consider this simple program, which does the following:
- Run a silly workload in a worker thread.
- Run the same workload again twice more in two parallel threads.
On a properly working numpy install, the second (parallel) step is almost as fast as the first step. But in my special build, the second step takes twice as long! Only 1 CPU is used. It's acting as if numpy.sqrt
doesn't release the GIL, but I know it should.
Man, I don't know how to break a numpy build like this even if I wanted to. It refuses to use more than 1 CPU core! How did I do this? How do I fix it?
Edit: More details: numpy-1.7.0, gcc, Linux (Fedora 16), but I don't think those specifics are too important. I've built with this configuration before without running into this problem. I guess I'm wondering if there's a particular OS or python setting that can cause behavior like this.
import numpy, threading, time
a1 = numpy.random.random((500,500,200)).astype(numpy.float32)
a2 = numpy.random.random((500,500,200)).astype(numpy.float32)
a3 = numpy.random.random((500,500,200)).astype(numpy.float32)
def numpy_workload(name, a):
print "starting numpy_workload " + name
for _ in range(10):
numpy.sqrt(a)
print "finished numpy_workload " + name
t1 = threading.Thread(target=lambda: numpy_workload("1", a1))
t2 = threading.Thread(target=lambda: numpy_workload("2", a2))
t3 = threading.Thread(target=lambda: numpy_workload("3", a3))
start = time.time()
t1.start()
t1.join()
stop = time.time()
print "Single thread done after {} seconds\n".format( stop - start )
start = time.time()
t2.start()
t3.start()
t2.join()
t3.join()
stop = time.time()
print "Two threads done after {} seconds\n".format( stop - start )