Can someone walk me through the x64 assembly code for this GCC auto-vectorized C loop that sums an array

Question

I compiled the following C code into assembly with -03 and I am confused why we shift right to %xmm1 and add it back to %xmm0. Can someone walk me through what the assembly code does and why it makes everything a factor of 16 than 4? The code in C is

int testFunction(int* input, int length) {
  int sum = 0;
  for (int i = 0; i < length; ++i) {
    sum += input[i];
  }
  return sum;
}

and the assembly code compiled with -03 is

testFunction:
        movq    %rdi, %rdx
        testl   %esi, %esi
        jle     .L7
        leal    -1(%rsi), %eax
        cmpl    $2, %eax
        jbe     .L8
        movl    %esi, %ecx
        movq    %rdi, %rax
        pxor    %xmm0, %xmm0
        shrl    $2, %ecx
        subl    $1, %ecx
        salq    $4, %rcx
        leaq    16(%rdi,%rcx), %rcx
.L4:
        movdqu  (%rax), %xmm2
        addq    $16, %rax
        paddd   %xmm2, %xmm0
        cmpq    %rcx, %rax
        jne     .L4
        movdqa  %xmm0, %xmm1
        movl    %esi, %ecx
        psrldq  $8, %xmm1
        andl    $-4, %ecx
        paddd   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrldq  $4, %xmm1
        paddd   %xmm1, %xmm0
        movd    %xmm0, %eax
        testb   $3, %sil
        je      .L11
.L3:
        movslq  %ecx, %rdi
        leaq    0(,%rdi,4), %r8
        addl    (%rdx,%rdi,4), %eax
        leal    1(%rcx), %edi
        cmpl    %edi, %esi
        jle     .L1
        addl    $2, %ecx
        addl    4(%rdx,%r8), %eax
        cmpl    %ecx, %esi
        jle     .L1
        addl    8(%rdx,%r8), %eax
        ret
.L7:
        xorl    %eax, %eax
.L1:
        ret
.L11:
        ret
.L8:
        xorl    %ecx, %ecx
        xorl    %eax, %eax
        jmp     .L3

score 3 · Answer 1 · answered Feb 22 '22 at 04:25

GCC auto-vectorized to do 4 additions in parallel with SSE2 paddd because x86-64 implies that SSE2 is available. (It would use AVX2 if you use -march=haswell or later, also allowing a possibly-unaligned memory source operand.)

It does this because -O3 includes -ftree-vectorize for GCC; if you wanted simpler scalar asm, use -fno-tree-vectorize. (See also How to remove "noise" from GCC/clang assembly output?). GCC12 and newer will be like clang, enabling auto-vectorization even at -O2.

The stuff after the main loop is an inefficient Fastest way to do horizontal SSE vector sum (or other reduction) - using movdqa/psrldq to bring down the high half so it can add the high pair to the low pair of elements in the vector sum. The efficient way to do that shuffle uses pshufd to copy-and-shuffle since the upper half of the register is don't-care at that point, but GCC does the horizontal sum pattern in target-independent logic, and apparently that makes it inconvenient to adapt the pattern to what shuffles the target (in this case x86-64 without AVX) can do efficiently. Clang doesn't have this problem and uses pshufd like a normal human would.

And the rest is fully-unrolled scalar cleanup, spending more code-size on that than the actual SIMD loop, which is typical for GCC. The movslq in that is there because you used int length, but pointers are 64-bit on x86-64 and GCC decides it needs to sign-extend something.

Clang prefers to unroll tiny loops by 4, spending more code-size there to maybe sustain two load+add operations per clock cycle on Intel since Sandybridge or AMD since Zen, unless you use -fno-unroll-loops to get clang to keeps its asm simpler; especially useful in more complex cases when you want to see how it's vectorizing something a little less trivial.

You can see which parts of the asm are relevant to handling length % 4 != 0 by changing the source to do size &= -4; or even size = 1024 to let GCC know that can't be the case, so it can omit logic for that. (Godbolt)

Then we get fairly simple asm, just the vectorization part without cleanup other than a horizontal sum. Clang's is simpler, but uses an indexed addressing mode for the load. (Which would be a missed-optimization when clang unrolls with AVX and folds a memory source operand into vpaddd (%rdi, %rcx, 4), %xmm0, %xmm0 - Micro fusion and addressing modes)

# GCC11.2 -O3 
testFunction:
        andl    $-4, %esi
        jle     .L4
        shrl    $2, %esi
        pxor    %xmm0, %xmm0             # sum_vec = {0,0,0,0}
        leal    -1(%rsi), %eax            # calculate a last-vector end-pointer
        salq    $4, %rax                  #  perhaps complicated by it being 32-bit
        leaq    16(%rdi,%rax), %rax       #  and/or by missed optimizations?

.L3:                             # do {
        movdqu  (%rdi), %xmm2        # vtmp = p[0..3]
        addq    $16, %rdi            # p += 4
        paddd   %xmm2, %xmm0         # sum_vec += vtmp
        cmpq    %rax, %rdi
        jne     .L3              # }while( p != endp )

        movdqa  %xmm0, %xmm1     # horizontal sum
        psrldq  $8, %xmm1
        paddd   %xmm1, %xmm0       # v[0] += v[2] ; v[1] += v[3]; high half = don't care
        movdqa  %xmm0, %xmm1
        psrldq  $4, %xmm1
        paddd   %xmm1, %xmm0       # v[0] += v[1]  high 3 elements = don't-care
        movd    %xmm0, %eax        # retval = v[0]
        ret
.L4:
        xorl    %eax, %eax       # length & -4  == 0  return path
        ret

The other instructions in the full version not present here are adding the last 0 to 3 elements to the scalar return value.

Older GCC would go scalar until reaching an alignment boundary, but this is typically not worth it, especially if data happens to be aligned. But in that case, you'd also want to simply the asm with input = __builtin_assume_aligned(input, 16);

Can someone walk me through the x64 assembly code for this GCC auto-vectorized C loop that sums an array

1 Answers1