I've been using numpy for quite some time now and am fond of just how much faster it is for simple operations on vectors and matrices, compared to e.g. looping over elements of the same array.
My understanding is that it is using SIMD CPU extensions, but according to some, at least some of its functionality is making use of multiprocessing (via openMP?). On the other hand, there are lots of questions here on SO (example) about speeding operations on numpy arrays up by using multiprocessing.
I have not seen numpy definitely use multiple cores at once, although it looks as if sometimes two cores (on an 8-core machine) are in use. But I may have been using the "wrong" functions for that, or using them in the wrong way, or maybe my matrices are too small to make it worth it?
The question therefore:
Are there some numpy functions which can use multiple processes on a shared-memory machine, either via openMP or some other means?
If yes, is there some place in the numpy documentation with a definite list of those functions?
And in that case, is there some documentation on what a user of numpy would have to do to make sure they use all available CPU cores, or some specific predetermined number of cores?
I'm aware that there are libraries which permit splitting numpy arrays and such up across multiple machines or compute nodes, but I suspect the use case for that is either with being able to handle more data than fits into local RAM, or speeding processing up more than what a single multi-core machine can achieve. This is however not what this question is about.
Update
Given the comment by @talonmies (who states that by default there's no such functionality in numpy, and it would depend on LAPACK and BLAS): What's the easiest way to obtain a suitably-compiled numpy version which makes use of multiple CPU cores (and hopefully also SIMD extensions)?
Or is the reason why numpy doesn't usually multiprocess that most people for whom that is important have already switched to using Multiprocessing or things like dask to handle multiple cores explicitly rather than having only the numpy bits accelerated implicitly?