Which numpy instructions are SIMD-optimized? How do I know?

Question

It is obvious for simple operations, e.g. A + B, np.sum(A, axis=0), that these are cache-optimized.

It is also obvious for complex operations, e.g. apply FFT on matrix A, B, that these are not cache-optimized.

However, the questions are for operations in the middle, e.g. np.where, np.apply_along_axis (this is probably not optimized), np.einsum (this is probably optmized), np.vstack, etc. How do I know if a given numpy function is optimized for cache hits, and that it is faster than two nested for-loops?

'two for loops' are slow because they are performed by the python interpreter. `A+B` is also interpreted at the top level, but the iteration over elements of the arrays is done compiled code. — hpaulj, Nov 06 '21 at 09:20
What do you mean by "cache-optimized" and "optimized for cache hits"? — Jérôme Richard, Nov 06 '21 at 10:38

score 3 · Answer 1 · answered Nov 06 '21 at 17:25

First a SO search on [numpy] simd turns up a number of answers.

One, https://stackoverflow.com/a/45798012/901925, found a src/umath/simd.inc.src file. It describes itself as "Currently contains sse2 functions that are built on amd64, x32 or non-generic builds (CFLAGS=-march=...)". This is low level code that may, depending on the build, be incorporated into the numpy binaries. It's not something that you, as a Python level, user will be able to detect or control.

More recent question with lots of hits is How is numpy so fast?. But the answers mostly deal with the c++ comparison code and its memory use. So it really doesn't address the numpy usage.

But for your purposes the really question is whether the operation uses compiled numpy methods, or does Python level iterations and objects.

First, do you understand how numpy arrays are stored, and how that differs from lists? Without knowing that difference, a lot of numpy speed discussions will be hard to understand. As a general rule, using arrays as though they were lists, with iterations and list comprehensions will be slower. And using lists in numpy functions incur a speed penalty as the list has to first be converted to an array.

Also object dtype arrays store their data a object references, so their calculations operate at list comprehension speeds. Fast numpy methods only work for numeric dtypes, ones that can be compiled with c native types - floats, integers etc.

As for your sample expressions

A + B

Operators like this are implemented as ufunc, which take full advantage of the array data storage. Since it can work with multidimensional arrays, and uses broadcasting the underlying code is quite complex, and not something you or I can readily read. At some low level it may take advantage of processor cashing and special instructions, but that's more a function of the c code macros and compiler options.

np.sum(A, axis=0)

sum is actually a np.add.reduce, so the above comments apply. But for lists, the native python sum is no slouch.

np.where

np.nonzero is one of the simpler compiled functions. It first uses np.count_nonzero to find how many nonzero elements there are. It uses this to allocate the tuple of arrays that it will return, and then loops through the argument again to fill in the indices. It's reasonably fast because it loops through the array's databuffer in clean c code.

np.apply_along_axis

This is slow, even compared to list comprehensions. It has to call your function once for each 1d array. It's that repeated call to a python function that takes the most time, more so than the actual iteration method. Functions like this do not compile your function, so on one way or other they are just covers for Python level iteration. The python code is available to be studied.

np.einsum

This is a complex function, working in different ways depending on the inputs. For simpler cases it just uses np.matmul/@, which can be very fast, depending on what BLAS like libraries you have. Years ago when I wrote a patch for it, einsum used nditer in cython.

np.vstack

This is a cover to np.concatenate. The python code is easy to read. concatenate is compiled. But these functions should be used right, with a whole list of arrays. Repeated use in a loop is worse than list append.

Which numpy instructions are SIMD-optimized? How do I know?

1 Answers1