4

I was very surprised by the difference between using signed and unsigned loop counter in this simple example:

double const* a;
__assume_aligned(a, 64);
double s = 0.0;

//for ( unsigned int i = 0; i < 1024*1024; i++ )
for ( int i = 0; i < 1024*1024; i++ )
{
    s += a[i];
}

In signed case, icc 19.0.0 produced (I am showing unrolled part of the loop):

..B1.2:
    vaddpd    zmm7, zmm7, ZMMWORD PTR [rdi+rax*8]
    vaddpd    zmm6, zmm6, ZMMWORD PTR [64+rdi+rax*8]
    vaddpd    zmm5, zmm5, ZMMWORD PTR [128+rdi+rax*8]
    vaddpd    zmm4, zmm4, ZMMWORD PTR [192+rdi+rax*8]
    vaddpd    zmm3, zmm3, ZMMWORD PTR [256+rdi+rax*8]
    vaddpd    zmm2, zmm2, ZMMWORD PTR [320+rdi+rax*8]
    vaddpd    zmm1, zmm1, ZMMWORD PTR [384+rdi+rax*8]
    vaddpd    zmm0, zmm0, ZMMWORD PTR [448+rdi+rax*8]
    add       rax, 64
    cmp       rax, 1048576
    jb        ..B1.2        # Prob 99%

In unsigned case, icc used extra registers to address memory, with corresponding LEAs:

..B1.2:
    lea       edx, DWORD PTR [8+rax]
    vaddpd    zmm6, zmm6, ZMMWORD PTR [rdi+rdx*8]
    lea       ecx, DWORD PTR [16+rax]
    vaddpd    zmm5, zmm5, ZMMWORD PTR [rdi+rcx*8]
    vaddpd    zmm7, zmm7, ZMMWORD PTR [rdi+rax*8]
    lea       esi, DWORD PTR [24+rax]
    vaddpd    zmm4, zmm4, ZMMWORD PTR [rdi+rsi*8]
    lea       r8d, DWORD PTR [32+rax]
    vaddpd    zmm3, zmm3, ZMMWORD PTR [rdi+r8*8]
    lea       r9d, DWORD PTR [40+rax]
    vaddpd    zmm2, zmm2, ZMMWORD PTR [rdi+r9*8]
    lea       r10d, DWORD PTR [48+rax]
    vaddpd    zmm1, zmm1, ZMMWORD PTR [rdi+r10*8]
    lea       r11d, DWORD PTR [56+rax]
    add       eax, 64
    vaddpd    zmm0, zmm0, ZMMWORD PTR [rdi+r11*8]
    cmp       eax, 1048576
    jb        ..B1.2        # Prob 99%

To me, it is surprising it did not produce same code (given compile time loop count). Is it a compiler optimization problem?

Compile options: -O3 -march=skylake-avx512 -mtune=skylake-avx512 -qopt-zmm-usage=high

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user2052436
  • 4,321
  • 1
  • 25
  • 46
  • 1
    It seems that compiler cannot deduce that truncation to 32-bits is actually not necessary. How does it work with `size_t`? – zch Oct 17 '18 at 20:23
  • 2
    It's a missed optimization because it could prove that `i` won't wrap but fails to. But signed-overflow is UB what lets it promote the 32-bit loop counter to 64-bit without redoing the sign-extension each time the way it's redoing the zero-extension each time here. Well not every time, but every 8. – Peter Cordes Oct 17 '18 at 20:25
  • @zch You are right, `size_t` produces same code as signed int – user2052436 Oct 17 '18 at 20:27

1 Answers1

3

This is a silly missed optimization by ICC. It's not specific to AVX512; it still happens with default/generic arch settings.

lea ecx, DWORD PTR [16+rax] is computing i+16 as part of the unroll, with truncation to 32-bit (32-bit operand-size) and zero-extension to 64-bit (implicit in x86-64 when writing a 32-bit register). This explicitly implements the semantics of unsigned wrap-around at the type width.

gcc and clang have no problem proving that unsigned i won't wrap, so they can optimize away the zero-extension from 32-bit unsigned to 64-bit pointer-width for use in an addressing mode, because the loop upper bound is known1.

Recall that unsigned wrap-around is well-defined in C and C++, but signed-overflow is undefined behaviour. That means that signed variables can be promoted to pointer width, and that the compiler doesn't have to redo sign-extension to pointer width every time they're used as an array index. (a[i] is equivalent to *(a+i), and the rules for adding integers to pointers mean that sign-extension is necessary for narrow values where the upper bits of the register might not match.)

Signed-overflow UB is why ICC is able to optimize properly for a signed counter even though it fails to use range info. See also http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html (about undefined behaviour). Notice that it's using add rax, 64 and cmp with 64-bit operand-size (RAX instead of EAX)


I made your code into a MCVE to test with other compilers. __assume_aligned is ICC-only, so I used the GNU C __builtin_assume_aligned.

#define COUNTER_TYPE unsigned

double sum(const double *a) {
    a = __builtin_assume_aligned(a, 64);
    double s = 0.0;

    for ( COUNTER_TYPE i = 0; i < 1024*1024; i++ )
        s += a[i];
    return s;
}

clang compiles your function like this (Godbolt compiler explorer):

# clang 7.0 -O3
sum:                                    # @sum
    xorpd   xmm0, xmm0
    xor     eax, eax
    xorpd   xmm1, xmm1
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
    addpd   xmm0, xmmword ptr [rdi + 8*rax]
    addpd   xmm1, xmmword ptr [rdi + 8*rax + 16]
    addpd   xmm0, xmmword ptr [rdi + 8*rax + 32]
    addpd   xmm1, xmmword ptr [rdi + 8*rax + 48]
    addpd   xmm0, xmmword ptr [rdi + 8*rax + 64]
    addpd   xmm1, xmmword ptr [rdi + 8*rax + 80]
    addpd   xmm0, xmmword ptr [rdi + 8*rax + 96]
    addpd   xmm1, xmmword ptr [rdi + 8*rax + 112]
    add     rax, 16                                  # 64-bit loop counter
    cmp     rax, 1048576
    jne     .LBB0_1

    addpd   xmm1, xmm0
    movapd  xmm0, xmm1         # horizontal sum
    movhlps xmm0, xmm1              # xmm0 = xmm1[1],xmm0[1]
    addpd   xmm0, xmm1
    ret

I didn't enable AVX, that doesn't change the loop structure. Note that clang only uses 2 vector accumulators, so it will bottleneck on FP add latency on most recent CPUs, if data is hot in L1d cache. Skylake can keep up to 8 addpd in flight at once (2 per clock throughput with 4 cycle latency). So ICC does a much better job for cases where (some of) the data is hot in L2 or especially L1d cache.

It's strange that clang didn't use a pointer-increment, if it's going to add/cmp anyway. It would only take a couple extra instructions ahead of the loop, and would simplify the addressing modes allowing micro-fusion of the load even on Sandybridge. (But it's not AVX, so Haswell and later can keep the load micro-fused. Micro fusion and addressing modes). GCC does that, but doesn't unroll at all, which is GCC's default without profile-guided optimization.

Anyway, ICC's AVX512 code will un-laminate into separate load and add uops in the issue/rename stage (or before being added to the IDQ, I'm not sure). So it's pretty silly that it doesn't use a pointer increment to save front-end bandwidth, consume less ROB space for a larger out-of-order window, and be more hyperthreading-friendly.


Footnote 1:

(And even if it wasn't, an infinite loop with no side effects like a volatile or atomic access is undefined behaviour, so even with i <= n with a runtime-variable n, the compiler would be allowed to assume the loop wasn't infinite and thus i didn't wrap. Is while(1); undefined behavior in C?)

In practice gcc and clang don't take advantage of this, and make a loop that actually is potentially infinite, and don't auto-vectorize because of that possible weirdness. So avoid i <= n with runtime variable n, especially for unsigned compares. Use i < n instead.

If unrolling, i += 2 can have a similar effect.

So doing the end-pointer and pointer-increment in the source is often good, because that's often optimal for the asm.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847