First of all, mainstream compilers are able to automatically unroll the loops (assuming optimizations are enabled) so index calculation is clearly not an issue. Here is for example the generated Clang assembly code (see on Godbolt):
compute: # @compute
add rdi, 452
xorps xmm0, xmm0
xor eax, eax
.LBB0_1: # =>This Loop Header: Depth=1
mov rcx, -8
.LBB0_2: # Parent Loop BB0_1 Depth=1
movss xmm1, dword ptr [rdi + 8*rcx - 388] # xmm1 = mem[0],zero,zero,zero
movss xmm2, dword ptr [rdi + 8*rcx - 384] # xmm2 = mem[0],zero,zero,zero
mulss xmm2, xmm2
mulss xmm1, xmm1
addss xmm1, xmm2
addss xmm1, xmm0
movss xmm0, dword ptr [rdi + 8*rcx - 324] # xmm0 = mem[0],zero,zero,zero
movss xmm2, dword ptr [rdi + 8*rcx - 320] # xmm2 = mem[0],zero,zero,zero
mulss xmm2, xmm2
mulss xmm0, xmm0
addss xmm0, xmm2
addss xmm0, xmm1
movss xmm1, dword ptr [rdi + 8*rcx - 260] # xmm1 = mem[0],zero,zero,zero
movss xmm2, dword ptr [rdi + 8*rcx - 256] # xmm2 = mem[0],zero,zero,zero
mulss xmm2, xmm2
mulss xmm1, xmm1
addss xmm1, xmm2
addss xmm1, xmm0
movss xmm0, dword ptr [rdi + 8*rcx - 196] # xmm0 = mem[0],zero,zero,zero
movss xmm2, dword ptr [rdi + 8*rcx - 192] # xmm2 = mem[0],zero,zero,zero
mulss xmm2, xmm2
mulss xmm0, xmm0
addss xmm0, xmm2
addss xmm0, xmm1
movss xmm1, dword ptr [rdi + 8*rcx - 132] # xmm1 = mem[0],zero,zero,zero
movss xmm2, dword ptr [rdi + 8*rcx - 128] # xmm2 = mem[0],zero,zero,zero
mulss xmm2, xmm2
mulss xmm1, xmm1
addss xmm1, xmm2
addss xmm1, xmm0
movss xmm0, dword ptr [rdi + 8*rcx - 68] # xmm0 = mem[0],zero,zero,zero
movss xmm2, dword ptr [rdi + 8*rcx - 64] # xmm2 = mem[0],zero,zero,zero
mulss xmm2, xmm2
mulss xmm0, xmm0
addss xmm0, xmm2
addss xmm0, xmm1
movss xmm1, dword ptr [rdi + 8*rcx - 4] # xmm1 = mem[0],zero,zero,zero
movss xmm2, dword ptr [rdi + 8*rcx] # xmm2 = mem[0],zero,zero,zero
mulss xmm2, xmm2
mulss xmm1, xmm1
addss xmm1, xmm2
addss xmm1, xmm0
movss xmm0, dword ptr [rdi + 8*rcx + 60] # xmm0 = mem[0],zero,zero,zero
movss xmm2, dword ptr [rdi + 8*rcx + 64] # xmm2 = mem[0],zero,zero,zero
mulss xmm2, xmm2
mulss xmm0, xmm0
addss xmm0, xmm2
addss xmm0, xmm1
inc rcx
jne .LBB0_2
add rax, 1
add rdi, 512
cmp rax, 1960
jne .LBB0_1
ret
Modern x86-64 processors can execute many instructions in parallel in an out-of-order way and they can pipeline them efficiently, but there is a catch: they cannot execute dependent instructions in parallel. In fact, neither the compiler nor the processor can reorder/optimize the chain of additions in sq
because floating point operations are not associative (see this post for more information). As a result, the accumulation is done serially. You can enable compiler optimizations (like -ffast-math
for GCC/Clang) so to address this issue (but -ffast-math
can be unsafe if the code contains special values like NaN or if it rely on the associativity). Alternatively, you can unroll the loop yourself using multiple accumulators so to break the dependency chain (which is latency-bound).
Additionally, the code is not vectorized using SIMD instructions. Indeed, mulss
and addss
are scalar instructions. Again, fast-math help to address this issue.
Moreover, compilers cannot generate a code using a wider SIMD instructions sets like AVX/AVX-2/AVX-512 on x86-64 processors since they are not supported by all processor (typically old ones). You can use -mavx
/-mavx2
/-mavx512
on GCC/Clang to address this issue though this require the target processor to support it (most 10-years old processors support at least AVX and most recent ones support at least AVX-2). You can also enable the fuse-multiply add instruction set with -mfma
.
Last but not least, the data structure is not SIMD-friendly so compilers can hardly vectorize it efficiently. It is better to operate on a structure of array rather than on an array of structure here. This problem is known as the "AoS VS SoA" where AoS means array of structures and SoA means structure of arrays. You can find some information about this here.