0

I generate high performance loop in runtime which for example sums two array. I want to unroll my loop. Which sequence of operations inside loop should I choose:

  1. a. Load as many data as possible (constrained by number of ymm registers)
    b. Process sum operation for all ymm registers.
    c. Store all ymm registers back to memory.
  2. a. Load data for one sum operation.
    b. perform sum operation.
    c. store result.
    d. repeat several times (constrained by number of ymm registers)

For example:

# rcx- size of array
# rsi - pointer to array 1
# rdi - pointer to array 2
# rax - pointer to result array

We will unroll 4 iteration at time

VARIANT 1

:StartLoop
test( rcx, rcx )
jz( EndLoop)
# load data from array 1
vmovups ymm0, ptr[rsi]
vmovups ymm1, ptr[rsi + 0x20]
vmovups ymm2, ptr[rsi + 0x40]
vmovups ymm3, ptr[rsi + 0x60]

# perform sum
vaddps ymm0, ymm0, ptr[rdi]
vaddps ymm1, ymm1, ptr[rdi + 0x20]
vaddps ymm2, ymm2, ptr[rdi + 0x40]
vaddps ymm3, ymm3, ptr[rdi + 0x60]

# store result
vmovups ptr[rax], ymm0
vmovups ptr[rax + 0x20], ymm1
vmovups ptr[rax + 0x40], ymm2
vmovups ptr[rax + 0x60], ymm3

#update pointers and counter
sub rcx 4 * 8
lea rsi, ptr[rsi + 0x80]
lea rdi, ptr[rdi + 0x80]
lea rax, ptr[rax + 0x80]
jmp StartLoop
:EndLoop

VARIANT 2

:StartLoop
test( rcx, rcx )
jz( EndLoop)
# Iteration 1
vmovups ymm0, ptr[rsi]
vaddps ymm0, ymm0, ptr[rdi]
vmovups ptr[rax], ymm0

#Iteration 2
vmovups ymm1, ptr[rsi + 0x20]
vaddps ymm1, ymm1, ptr[rdi + 0x20]
vmovups ptr[rax + 0x20], ymm1

#Iteration 3
vmovups ymm2, ptr[rsi + 0x40]
vaddps ymm2, ymm2, ptr[rdi + 0x40]
vmovups ptr[rax + 0x40], ymm2

#Iteration 4
vmovups ymm3, ptr[rsi + 0x60]
vaddps ymm3, ymm3, ptr[rdi + 0x60]
vmovups ptr[rax + 0x60], ymm3

#update pointers and counter
sub rcx 4 * 8
lea rsi, ptr[rsi + 0x80]
lea rdi, ptr[rdi + 0x80]
lea rax, ptr[rax + 0x80]
jmp StartLoop
:EndLoop

Which variant of loading of the SIMD pipelines and memory bus is better and why? Thanks!

Yuriy
  • 377
  • 1
  • 2
  • 10
  • 3
    Instruction scheduling over short distances like this usually doesn't make a measurable difference; all CPUs with AVX have at least some OoO exec (even Alder Lake E-cores). Doing stores after independent loads is usually not a bad thing, though. But why are you using LEA instead of ADD to increment pointers? (Or better, `sub -0x80` to use an imm8.) – Peter Cordes Feb 03 '22 at 22:05
  • 2
    And what's up with having a test at the *top*? It seems to be inside the loop, which is obviously bad since you already need some kind of jump at the bottom: [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926). A normal loop might have a test/jcc to skip it entirely if it needs to run zero iterations, otherwise just the sub/jnz, or sub/jg or jge at the bottom. (Together so they can macro-fuse) – Peter Cordes Feb 03 '22 at 22:05
  • 2
    As a third option, you might try encoding it using intrinsics and let the compiler take a crack at optimizing the code. Compare the perf with your existing approaches. It's possible that interleaving instructions might give better perf than doing things in a human-logical way. BTW: Does RCX contain the number of elements? Or the size of the buffer? And can you safely assume that the number of elements divides evenly by 4? – David Wohlferd Feb 04 '22 at 05:35
  • @PeterCordes About test/jz... I forgot to insert back jump at the end of the loop. I use test instruction at the beginning of the loop because I think that it helps instruction prefetcher to prefetch most probable branch. – Yuriy Feb 04 '22 at 07:12
  • @DavidWohlferd I tried both approaches but I cant measure any significant increasing of performance of either of variant. About value in RCX... It is just an example, in real code I of course check all possible cases. – Yuriy Feb 04 '22 at 07:16
  • 2
    @Yuriy: You're mistaken about branch prediction. There's no reason to expect a normally-not-taken branch to predict better than a normally-taken branch at the bottom, even on old simple CPUs with static prediction heuristics. There is a concrete benefit to only having one total jump in the loop, though, at the bottom. (fewer total uops for the front-end, letting OoO exec see farther). See the link in my earlier comment. This is one of the cases where compilers are getting it right, not a missed optimization. – Peter Cordes Feb 04 '22 at 07:29
  • I'd second David's recommendation to let clang unroll for you (or maybe GCC with the right options), as long as they avoid indexed addressing modes like you're doing (which partially defeat the purpose of unrolling on Intel CPUs where they unlaminate with AVX operations: [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634)) – Peter Cordes Feb 04 '22 at 07:32

0 Answers0