[For row-major data, SIMD access to a row is fast, but to a column slow]
Yes, that is the nature of the x86-64 and similar architectures. Accessing consecutive data in memory is fast, but accessing scattered data (whether randomly or in a regular pattern) is slow. It is a consequence of having processor caches.
There are two basic approaches: copy the data to a new order that facilitates better access patterns, or you do the computations in an order that allows better access patterns.
No, there are no rules of thumb or golden tricks that makes all just work. In fact, even comparing different implementations is tricky, because there are so many complex interactions (from cache latencies to operation interleaving, to cache and memory access patterns), that results are heavily dependent on particular hardware and the dataset at hand.
Let's look at the typical example case, matrix-matrix multiplication. Let's say we multiply two 5×5 matrices (c = a × b), using standard C row-major data order:
c00 c01 c02 c03 c04 a00 a01 a02 a03 a04 b00 b01 b02 b03 b04
c05 c06 c07 c08 c09 a05 a06 a07 a08 a09 b05 b06 b07 b08 b09
c10 c11 c12 c13 c14 = a10 a11 a12 a13 a14 × b10 b11 b12 b13 b14
c15 c16 c17 c18 c19 a15 a16 a17 a18 a19 b15 b16 b17 b18 b19
c20 c21 c22 c23 c24 a20 a21 a22 a23 a24 b20 b21 b22 b23 b24
If we write the result as vertical SIMD vector registers with five components, we have
c00 a00 b00 a01 b05 a02 b10 a03 b15 a04 b20
c01 a00 b01 a01 b06 a02 b11 a03 b16 a04 b21
c02 = a00 × b02 + a01 × b07 + a02 × b12 + a03 × b17 + a04 × b22
c03 a00 b03 a01 b08 a02 b13 a03 b18 a04 b23
c04 a00 b04 a01 b09 a02 b14 a03 b19 a04 b24
c05 a05 b00 a06 b05 a07 b10 a08 b15 a09 b20
c06 a05 b01 a06 b06 a07 b11 a08 b16 a09 b21
c07 = a05 × b02 + a06 × b07 + a07 × b12 + a08 × b17 + a09 × b22
c08 a05 b03 a06 b08 a07 b13 a08 b18 a09 b23
c09 a05 b04 a06 b09 a07 b14 a08 b19 a09 b24
and so on. In other words, if c
has the same order as b
, we can use SIMD registers with consecutive memory contents for both c
and b
, and only gather a
. Furthermore, the SIMD registers for a
have all components the same value.
Note, however, that the b
registers repeat for all five rows of c
. So, it might be better to initialize c
to zero, then do the additions with products having the same b
SIMD registers:
c00 a00 b00 c05 a05 b00 c10 a10 b00 c15 a15 b00 c20 a20 b00
c01 a00 b01 c06 a05 b01 c11 a10 b01 c16 a15 b01 c21 a20 b01
c02 += a00 × b02, c07 += a05 × b02, c12 += a10 × b02, c17 += a15 × b02, c22 += a20 × b02
c03 a00 × b03 c08 a05 b03 c13 a10 b03 c18 a15 b03 c23 a20 b03
c04 a00 × b04 c09 a05 b04 c14 a10 b04 c19 a15 b04 c24 a20 b04
If we transposed a
first, then the SIMD vector registers for a
would also get values from consecutive memory locations. In fact, if a
is large enough, linearizing the memory access pattern for a
too gives large enough speed boost that it is faster to do a transpose copy (using uint32_t
for floats, and uint64_t
for doubles; i.e. not using SIMD or floating point at all for the transpose, but just copying the storage in transpose order).
Note that the situation with column-major data order, i.e. data order transposed compared to above, is very similar. There is deep symmetry here. For example, if c
and b
have the same data order, and a
the opposite data order, you can SIMD-vectorize the matrix product efficiently without having to copy any data. Only the summing differs, as that depends on the data order, and matrix multiplication is not commutative (a×b != b×a).
Obviously, a major wrinkle is that the SIMD vector registers have a fixed size, so instead of using a complete row as a register as in the example above, you can only use partial rows. (If the number of columns in the result is not a multiple of the SIMD register width, you have that partial vector to worry about as well.)
SSE and AVX have a relatively large number of registers (8, 16, or 32, depending on the set of extensions used), and depending on the particular processor type, might be able to perform some vector operations simultaneously, or at least with fewer latencies if unrelated vector operations are interleaved. So, even the choice of how wide a chunk to operate at once, and whether that chunk is like an extended vector, or more like a block submatrix, is up to discussion, testing, and comparison.
So how do I do the matrix-matrix multiplication most efficiently using SIMD?
Like I said, that depends on the data set. No easy answers, I'm afraid.
The main parameters (to choosing the most efficient approach) are the sizes and memory ordering of the multiplicand and result matrices.
It gets even more interesting, if you calculate the product of more than two matrices of different sizes. This is because the number of operations then depends on the order of the products.
Why are you so discouraging?
I'm not, actually. All of the above means that not too many people can handle this kind of complexity and stay sane and productive, so there is a lot of undiscovered approaches, and a lot to gain in real-world performance.
Even if we ignore the SIMD intrinsics compilers provide (<x86intrin.h>
in this case), we can apply the logic above when designing internal data structures, so that the C compiler we use has the best opportunities for vectorizing the calculations for us. (They're not very good at it yet, though. Like I said, complex stuff. Some like Fortran better than C, because its expressions and rules make it easier for Fortran compilers to optimize and vectorize them.)
If this was simple or easy, the solutions would be well known by now. But they aren't, because this is not. But that does not mean that this is impossible or out of our reach; all it means is that smart enough developers haven't yet put enough effort into this to unravel this.