In a helpful but somewhat dated November, 2006 article on vectorizing code with vDSP, the author makes the statement:
Important to keep in mind is the fact that only operations with strides equal to one will deliver blazingly fast vectorized code.
Is this still true today? Even on the newer Intel processors with their more capable vector intrinsics?
I ask because I am in the process of writing some matrix math routines, and have just started down the path of switching them all to use Fortran-like column-major ordering in an effort to be more readily compatible with MATLAB, BLAS and LAPACK. But now I find some of my calls to vDSP need to work on vectors that are no longer contiguous…
At present these vDSP calls are the bottleneck routines that my code exercises. Not to say that this will always be the case, but for now at least I would hate to slow them down just to make calls to those other libraries simpler.
My most-frequently-called vDSP routine right now is vDSP_distancesq
in case that makes a difference.