How can I optimize this simple for loop for matrix multiplication

Question

Is there any way to optimize this simple 'for' loop code?

I need to have the fastest execution time.

typedef struct
{
  float real;
  float img;
} my_complex;

my_complex* Q = (my_complex*)malloc(8 * 8 * 1960 * sizeof(my_complex));

float sq = 0;
for (i = 0; i < 1960; i++)
  for (j = 0; j < 8; j++)
    for (k = 0; k < 8; k++)
    {
      sq += Q[i * 8 * 8 + k * 8 + j].real * Q[i * 8 * 8 + k * 8 + j].real
          + Q[i * 8 * 8 + k * 8 + j].img * Q[i * 8 * 8 + k * 8 + j].img;
    }

`i * 8 * 8 + k * 8 + j` is calculated 4 times. But it is likely optimized by the compiler. — Eugene Sh., Jun 07 '22 at 14:00
`for (int i = 0; i < 1960 * 8 * 8; i++) sq += Q[i].real * Q[i].real + Q[i].img + Q[i].img;`. Many fewer loops; many fewer complex index calculations. — Jonathan Leffler, Jun 07 '22 at 14:06
@JonathanLeffler You mean less loops in the code, right? Same number of iteration though. — Eugene Sh., Jun 07 '22 at 14:08
This is not a matrix multiplication, BTW. It is merely calculating the sum of square of magnitudes of the matrix elements. — Eugene Sh., Jun 07 '22 at 14:10
@EugeneSh.: I mean one loop instead of three loops. Yes, the total number of operations in terms of multiplication and addition of floating-point numbers is the same — but there is far less computation in the indexes. Indeed, you could go so far as to use `my_complex *vp = Q; for (int i = 0; i < 1960 * 8 * 8; i++, vp++) sq += vp->real * vp->real + vp->img * vp->img;` which avoids the subscripting altogether. (I observe that the array initialization is not shown — I am assuming that the code does exist so that there isn't undefined behaviour. And if it were my code, I'd use `imag`, not `img`.) — Jonathan Leffler, Jun 07 '22 at 14:12
@JonathanLeffler Given the relative positions of the `j` and `k` loops and the indexing calculation in the posted code, it should also be much more cache friendly. — Bob__, Jun 07 '22 at 14:37

score 2 · Answer 1 · answered Jun 07 '22 at 17:45

First of all, mainstream compilers are able to automatically unroll the loops (assuming optimizations are enabled) so index calculation is clearly not an issue. Here is for example the generated Clang assembly code (see on Godbolt):

compute:                                # @compute
        add     rdi, 452
        xorps   xmm0, xmm0
        xor     eax, eax
.LBB0_1:                                # =>This Loop Header: Depth=1
        mov     rcx, -8
.LBB0_2:                                #   Parent Loop BB0_1 Depth=1
        movss   xmm1, dword ptr [rdi + 8*rcx - 388] # xmm1 = mem[0],zero,zero,zero
        movss   xmm2, dword ptr [rdi + 8*rcx - 384] # xmm2 = mem[0],zero,zero,zero
        mulss   xmm2, xmm2
        mulss   xmm1, xmm1
        addss   xmm1, xmm2
        addss   xmm1, xmm0
        movss   xmm0, dword ptr [rdi + 8*rcx - 324] # xmm0 = mem[0],zero,zero,zero
        movss   xmm2, dword ptr [rdi + 8*rcx - 320] # xmm2 = mem[0],zero,zero,zero
        mulss   xmm2, xmm2
        mulss   xmm0, xmm0
        addss   xmm0, xmm2
        addss   xmm0, xmm1
        movss   xmm1, dword ptr [rdi + 8*rcx - 260] # xmm1 = mem[0],zero,zero,zero
        movss   xmm2, dword ptr [rdi + 8*rcx - 256] # xmm2 = mem[0],zero,zero,zero
        mulss   xmm2, xmm2
        mulss   xmm1, xmm1
        addss   xmm1, xmm2
        addss   xmm1, xmm0
        movss   xmm0, dword ptr [rdi + 8*rcx - 196] # xmm0 = mem[0],zero,zero,zero
        movss   xmm2, dword ptr [rdi + 8*rcx - 192] # xmm2 = mem[0],zero,zero,zero
        mulss   xmm2, xmm2
        mulss   xmm0, xmm0
        addss   xmm0, xmm2
        addss   xmm0, xmm1
        movss   xmm1, dword ptr [rdi + 8*rcx - 132] # xmm1 = mem[0],zero,zero,zero
        movss   xmm2, dword ptr [rdi + 8*rcx - 128] # xmm2 = mem[0],zero,zero,zero
        mulss   xmm2, xmm2
        mulss   xmm1, xmm1
        addss   xmm1, xmm2
        addss   xmm1, xmm0
        movss   xmm0, dword ptr [rdi + 8*rcx - 68] # xmm0 = mem[0],zero,zero,zero
        movss   xmm2, dword ptr [rdi + 8*rcx - 64] # xmm2 = mem[0],zero,zero,zero
        mulss   xmm2, xmm2
        mulss   xmm0, xmm0
        addss   xmm0, xmm2
        addss   xmm0, xmm1
        movss   xmm1, dword ptr [rdi + 8*rcx - 4] # xmm1 = mem[0],zero,zero,zero
        movss   xmm2, dword ptr [rdi + 8*rcx]   # xmm2 = mem[0],zero,zero,zero
        mulss   xmm2, xmm2
        mulss   xmm1, xmm1
        addss   xmm1, xmm2
        addss   xmm1, xmm0
        movss   xmm0, dword ptr [rdi + 8*rcx + 60] # xmm0 = mem[0],zero,zero,zero
        movss   xmm2, dword ptr [rdi + 8*rcx + 64] # xmm2 = mem[0],zero,zero,zero
        mulss   xmm2, xmm2
        mulss   xmm0, xmm0
        addss   xmm0, xmm2
        addss   xmm0, xmm1
        inc     rcx
        jne     .LBB0_2
        add     rax, 1
        add     rdi, 512
        cmp     rax, 1960
        jne     .LBB0_1
        ret

Modern x86-64 processors can execute many instructions in parallel in an out-of-order way and they can pipeline them efficiently, but there is a catch: they cannot execute dependent instructions in parallel. In fact, neither the compiler nor the processor can reorder/optimize the chain of additions in sq because floating point operations are not associative (see this post for more information). As a result, the accumulation is done serially. You can enable compiler optimizations (like -ffast-math for GCC/Clang) so to address this issue (but -ffast-math can be unsafe if the code contains special values like NaN or if it rely on the associativity). Alternatively, you can unroll the loop yourself using multiple accumulators so to break the dependency chain (which is latency-bound).

Additionally, the code is not vectorized using SIMD instructions. Indeed, mulss and addss are scalar instructions. Again, fast-math help to address this issue.

Moreover, compilers cannot generate a code using a wider SIMD instructions sets like AVX/AVX-2/AVX-512 on x86-64 processors since they are not supported by all processor (typically old ones). You can use -mavx/-mavx2/-mavx512 on GCC/Clang to address this issue though this require the target processor to support it (most 10-years old processors support at least AVX and most recent ones support at least AVX-2). You can also enable the fuse-multiply add instruction set with -mfma.

Last but not least, the data structure is not SIMD-friendly so compilers can hardly vectorize it efficiently. It is better to operate on a structure of array rather than on an array of structure here. This problem is known as the "AoS VS SoA" where AoS means array of structures and SoA means structure of arrays. You can find some information about this here.

How can I optimize this simple for loop for matrix multiplication

1 Answers1