This is a silly missed optimization by ICC. It's not specific to AVX512; it still happens with default/generic arch settings.
lea ecx, DWORD PTR [16+rax]
is computing i+16
as part of the unroll, with truncation to 32-bit (32-bit operand-size) and zero-extension to 64-bit (implicit in x86-64 when writing a 32-bit register). This explicitly implements the semantics of unsigned wrap-around at the type width.
gcc and clang have no problem proving that unsigned i
won't wrap, so they can optimize away the zero-extension from 32-bit unsigned to 64-bit pointer-width for use in an addressing mode, because the loop upper bound is known1.
Recall that unsigned wrap-around is well-defined in C and C++, but signed-overflow is undefined behaviour. That means that signed variables can be promoted to pointer width, and that the compiler doesn't have to redo sign-extension to pointer width every time they're used as an array index. (a[i]
is equivalent to *(a+i)
, and the rules for adding integers to pointers mean that sign-extension is necessary for narrow values where the upper bits of the register might not match.)
Signed-overflow UB is why ICC is able to optimize properly for a signed counter even though it fails to use range info. See also http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html (about undefined behaviour). Notice that it's using add rax, 64
and cmp
with 64-bit operand-size (RAX instead of EAX)
I made your code into a MCVE to test with other compilers. __assume_aligned
is ICC-only, so I used the GNU C __builtin_assume_aligned
.
#define COUNTER_TYPE unsigned
double sum(const double *a) {
a = __builtin_assume_aligned(a, 64);
double s = 0.0;
for ( COUNTER_TYPE i = 0; i < 1024*1024; i++ )
s += a[i];
return s;
}
clang compiles your function like this (Godbolt compiler explorer):
# clang 7.0 -O3
sum: # @sum
xorpd xmm0, xmm0
xor eax, eax
xorpd xmm1, xmm1
.LBB0_1: # =>This Inner Loop Header: Depth=1
addpd xmm0, xmmword ptr [rdi + 8*rax]
addpd xmm1, xmmword ptr [rdi + 8*rax + 16]
addpd xmm0, xmmword ptr [rdi + 8*rax + 32]
addpd xmm1, xmmword ptr [rdi + 8*rax + 48]
addpd xmm0, xmmword ptr [rdi + 8*rax + 64]
addpd xmm1, xmmword ptr [rdi + 8*rax + 80]
addpd xmm0, xmmword ptr [rdi + 8*rax + 96]
addpd xmm1, xmmword ptr [rdi + 8*rax + 112]
add rax, 16 # 64-bit loop counter
cmp rax, 1048576
jne .LBB0_1
addpd xmm1, xmm0
movapd xmm0, xmm1 # horizontal sum
movhlps xmm0, xmm1 # xmm0 = xmm1[1],xmm0[1]
addpd xmm0, xmm1
ret
I didn't enable AVX, that doesn't change the loop structure. Note that clang only uses 2 vector accumulators, so it will bottleneck on FP add latency on most recent CPUs, if data is hot in L1d cache. Skylake can keep up to 8 addpd
in flight at once (2 per clock throughput with 4 cycle latency). So ICC does a much better job for cases where (some of) the data is hot in L2 or especially L1d cache.
It's strange that clang didn't use a pointer-increment, if it's going to add/cmp anyway. It would only take a couple extra instructions ahead of the loop, and would simplify the addressing modes allowing micro-fusion of the load even on Sandybridge. (But it's not AVX, so Haswell and later can keep the load micro-fused. Micro fusion and addressing modes). GCC does that, but doesn't unroll at all, which is GCC's default without profile-guided optimization.
Anyway, ICC's AVX512 code will un-laminate into separate load and add uops in the issue/rename stage (or before being added to the IDQ, I'm not sure). So it's pretty silly that it doesn't use a pointer increment to save front-end bandwidth, consume less ROB space for a larger out-of-order window, and be more hyperthreading-friendly.
Footnote 1:
(And even if it wasn't, an infinite loop with no side effects like a volatile
or atomic
access is undefined behaviour, so even with i <= n
with a runtime-variable n
, the compiler would be allowed to assume the loop wasn't infinite and thus i
didn't wrap. Is while(1); undefined behavior in C?)
In practice gcc and clang don't take advantage of this, and make a loop that actually is potentially infinite, and don't auto-vectorize because of that possible weirdness. So avoid i <= n
with runtime variable n
, especially for unsigned compares. Use i < n
instead.
If unrolling, i += 2
can have a similar effect.
So doing the end-pointer and pointer-increment in the source is often good, because that's often optimal for the asm.