I got some speedup on my code when I linked my numpy to MKL. It's still not fast enough so we are considering using cython. The approach I have in mind is to use CythonGSL to perform the expensive functions in cython using gsl's blas functions. However there's a chance this is a waste of time because numpy is already making MKL do some of its work.
However I don't know how much and exactly what is being done by MKL. The expensive bits of my code are np.sums and np.dots. I suspect by linking MKL the code is already the most optimized it can be, but I'm not sure. So can someone that knows about what numpy + MKL's behavior tell me if I'm probably wasting my time by doing a cython implementation?