First a SO search on [numpy] simd
turns up a number of answers.
One, https://stackoverflow.com/a/45798012/901925, found a src/umath/simd.inc.src
file. It describes itself as "Currently contains sse2 functions that are built on amd64, x32 or non-generic builds (CFLAGS=-march=...)". This is low level code that may, depending on the build, be incorporated into the numpy
binaries. It's not something that you, as a Python level, user will be able to detect or control.
More recent question with lots of hits is How is numpy so fast?.
But the answers mostly deal with the c++
comparison code and its memory use. So it really doesn't address the numpy
usage.
But for your purposes the really question is whether the operation uses compiled numpy
methods, or does Python level iterations and objects.
First, do you understand how numpy
arrays are stored, and how that differs from lists? Without knowing that difference, a lot of numpy speed discussions will be hard to understand. As a general rule, using arrays as though they were lists, with iterations and list comprehensions will be slower. And using lists in numpy
functions incur a speed penalty as the list has to first be converted to an array.
Also object
dtype arrays store their data a object references, so their calculations operate at list comprehension speeds. Fast numpy methods only work for numeric dtypes, ones that can be compiled with c
native types - floats, integers etc.
As for your sample expressions
A + B
Operators like this are implemented as ufunc
, which take full advantage of the array data storage. Since it can work with multidimensional arrays, and uses broadcasting
the underlying code is quite complex, and not something you or I can readily read. At some low level it may take advantage of processor cashing and special instructions, but that's more a function of the c
code macros and compiler options.
np.sum(A, axis=0)
sum
is actually a np.add.reduce
, so the above comments apply. But for lists, the native python sum
is no slouch.
np.where
np.nonzero
is one of the simpler compiled functions. It first uses np.count_nonzero
to find how many nonzero elements there are. It uses this to allocate the tuple of arrays that it will return, and then loops through the argument again to fill in the indices. It's reasonably fast because it loops through the array's databuffer in clean c
code.
np.apply_along_axis
This is slow, even compared to list comprehensions. It has to call your function once for each 1d array. It's that repeated call to a python function that takes the most time, more so than the actual iteration method. Functions like this do not compile your function, so on one way or other they are just covers for Python level iteration. The python code is available to be studied.
np.einsum
This is a complex function, working in different ways depending on the inputs. For simpler cases it just uses np.matmul/@
, which can be very fast, depending on what BLAS
like libraries you have. Years ago when I wrote a patch for it, einsum
used nditer
in cython
.
np.vstack
This is a cover to np.concatenate
. The python code is easy to read. concatenate
is compiled. But these functions should be used right, with a whole list of arrays. Repeated use in a loop is worse than list append
.