I generate high performance loop in runtime which for example sums two array. I want to unroll my loop. Which sequence of operations inside loop should I choose:
- a. Load as many data as possible (constrained by number of ymm registers)
b. Process sum operation for all ymm registers.
c. Store all ymm registers back to memory. - a. Load data for one sum operation.
b. perform sum operation.
c. store result.
d. repeat several times (constrained by number of ymm registers)
For example:
# rcx- size of array
# rsi - pointer to array 1
# rdi - pointer to array 2
# rax - pointer to result array
We will unroll 4 iteration at time
VARIANT 1
:StartLoop
test( rcx, rcx )
jz( EndLoop)
# load data from array 1
vmovups ymm0, ptr[rsi]
vmovups ymm1, ptr[rsi + 0x20]
vmovups ymm2, ptr[rsi + 0x40]
vmovups ymm3, ptr[rsi + 0x60]
# perform sum
vaddps ymm0, ymm0, ptr[rdi]
vaddps ymm1, ymm1, ptr[rdi + 0x20]
vaddps ymm2, ymm2, ptr[rdi + 0x40]
vaddps ymm3, ymm3, ptr[rdi + 0x60]
# store result
vmovups ptr[rax], ymm0
vmovups ptr[rax + 0x20], ymm1
vmovups ptr[rax + 0x40], ymm2
vmovups ptr[rax + 0x60], ymm3
#update pointers and counter
sub rcx 4 * 8
lea rsi, ptr[rsi + 0x80]
lea rdi, ptr[rdi + 0x80]
lea rax, ptr[rax + 0x80]
jmp StartLoop
:EndLoop
VARIANT 2
:StartLoop
test( rcx, rcx )
jz( EndLoop)
# Iteration 1
vmovups ymm0, ptr[rsi]
vaddps ymm0, ymm0, ptr[rdi]
vmovups ptr[rax], ymm0
#Iteration 2
vmovups ymm1, ptr[rsi + 0x20]
vaddps ymm1, ymm1, ptr[rdi + 0x20]
vmovups ptr[rax + 0x20], ymm1
#Iteration 3
vmovups ymm2, ptr[rsi + 0x40]
vaddps ymm2, ymm2, ptr[rdi + 0x40]
vmovups ptr[rax + 0x40], ymm2
#Iteration 4
vmovups ymm3, ptr[rsi + 0x60]
vaddps ymm3, ymm3, ptr[rdi + 0x60]
vmovups ptr[rax + 0x60], ymm3
#update pointers and counter
sub rcx 4 * 8
lea rsi, ptr[rsi + 0x80]
lea rdi, ptr[rdi + 0x80]
lea rax, ptr[rax + 0x80]
jmp StartLoop
:EndLoop
Which variant of loading of the SIMD pipelines and memory bus is better and why? Thanks!