GCC auto-vectorized to do 4 additions in parallel with SSE2 paddd
because x86-64 implies that SSE2 is available. (It would use AVX2 if you use -march=haswell
or later, also allowing a possibly-unaligned memory source operand.)
It does this because -O3
includes -ftree-vectorize
for GCC; if you wanted simpler scalar asm, use -fno-tree-vectorize
. (See also How to remove "noise" from GCC/clang assembly output?). GCC12 and newer will be like clang, enabling auto-vectorization even at -O2
.
The stuff after the main loop is an inefficient Fastest way to do horizontal SSE vector sum (or other reduction) - using movdqa/psrldq to bring down the high half so it can add the high pair to the low pair of elements in the vector sum. The efficient way to do that shuffle uses pshufd to copy-and-shuffle since the upper half of the register is don't-care at that point, but GCC does the horizontal sum pattern in target-independent logic, and apparently that makes it inconvenient to adapt the pattern to what shuffles the target (in this case x86-64 without AVX) can do efficiently. Clang doesn't have this problem and uses pshufd
like a normal human would.
And the rest is fully-unrolled scalar cleanup, spending more code-size on that than the actual SIMD loop, which is typical for GCC. The movslq
in that is there because you used int length
, but pointers are 64-bit on x86-64 and GCC decides it needs to sign-extend something.
Clang prefers to unroll tiny loops by 4, spending more code-size there to maybe sustain two load+add operations per clock cycle on Intel since Sandybridge or AMD since Zen, unless you use -fno-unroll-loops
to get clang to keeps its asm simpler; especially useful in more complex cases when you want to see how it's vectorizing something a little less trivial.
You can see which parts of the asm are relevant to handling length % 4 != 0
by changing the source to do size &= -4;
or even size = 1024
to let GCC know that can't be the case, so it can omit logic for that. (Godbolt)
Then we get fairly simple asm, just the vectorization part without cleanup other than a horizontal sum. Clang's is simpler, but uses an indexed addressing mode for the load. (Which would be a missed-optimization when clang unrolls with AVX and folds a memory source operand into vpaddd (%rdi, %rcx, 4), %xmm0, %xmm0
- Micro fusion and addressing modes)
# GCC11.2 -O3
testFunction:
andl $-4, %esi
jle .L4
shrl $2, %esi
pxor %xmm0, %xmm0 # sum_vec = {0,0,0,0}
leal -1(%rsi), %eax # calculate a last-vector end-pointer
salq $4, %rax # perhaps complicated by it being 32-bit
leaq 16(%rdi,%rax), %rax # and/or by missed optimizations?
.L3: # do {
movdqu (%rdi), %xmm2 # vtmp = p[0..3]
addq $16, %rdi # p += 4
paddd %xmm2, %xmm0 # sum_vec += vtmp
cmpq %rax, %rdi
jne .L3 # }while( p != endp )
movdqa %xmm0, %xmm1 # horizontal sum
psrldq $8, %xmm1
paddd %xmm1, %xmm0 # v[0] += v[2] ; v[1] += v[3]; high half = don't care
movdqa %xmm0, %xmm1
psrldq $4, %xmm1
paddd %xmm1, %xmm0 # v[0] += v[1] high 3 elements = don't-care
movd %xmm0, %eax # retval = v[0]
ret
.L4:
xorl %eax, %eax # length & -4 == 0 return path
ret
The other instructions in the full version not present here are adding the last 0 to 3 elements to the scalar return value.
Older GCC would go scalar until reaching an alignment boundary, but this is typically not worth it, especially if data happens to be aligned. But in that case, you'd also want to simply the asm with input = __builtin_assume_aligned(input, 16);