Unrolling 1-cycle loop reduces performance by 25% on Skylake. uops scheduling issue?

Question

TL;DR I have a loop that takes 1 cycle to execute on Skylake (it does 3 additions + 1 inc/jump).

When I unroll it more than 2 times (no matter how much), my program runs about 25% slower. It might have something to do with alignment, but I don't clearly see what.

EDIT: this question used to ask about why uops were delivered by the DSB rather than the MITE. This has now be moved to this question.

I was trying to benchmark a loop which does 3 additions on my Skylake. This loop should execute in one cycle, since 3 add + 1 increment fused with a conditional jump, once fused can execute in one cycle. And it does, as expected.

However, at some point, my C compiler tried to unroll that loop, yielding worse performance. I'm now trying to understand why the unrolled loop has worse performance that the non-unrolled one, since I expected both to have the same performance, or maybe the unrolled one to be less than 15% slower.

Here is my C code:

int main() {
  int a, b, c, d;

  #pragma unroll(2)
  for (unsigned long i = 0; i < 2000000000; i++) {
    asm volatile("" : "+r" (a), "+r" (b), "+r" (c), "+r" (d));
    a = a + d;
    b = b + d;
    c = c + d;
  }

  // Prevent data from being optimized out
  asm volatile("" : "+r" (a), "+r" (b), "+r" (c));

}

Compiling with Clang 7.0.0 -O3 produces the following (cleaned) assembly (called v1 from now on):

    movl    $2000000000, %esi
    .p2align    4, 0x90
.LBB0_1:
    addl    %edi, %edx
    addl    %edi, %ecx
    addl    %edi, %eax
    addl    %edi, %edx
    addl    %edi, %ecx
    addl    %edi, %eax
    addq    $-2, %rsi
    jne .LBB0_1

And benchmarking with perf stat -e cycles shows that it takes about 2 cycles per iteration.

However, replacing any of the registers with a "new 64-bit register" (r8 to r15) causes the loop to execute in 3 cycles instead of 2 (let's call this code v2):

    movl    $2000000000, %esi
    .p2align    4, 0x90
.LBB0_1:
    addl    %edi, %r14d
    addl    %edi, %ecx
    addl    %edi, %eax
    addl    %edi, %r14d
    addl    %edi, %ecx
    addl    %edi, %eax
    addq    $-2, %rsi
    jne .LBB0_1

This is not a random example: Clang actually produces this loop if I add some stuff to my program and get unlucky (my initial version was the same C code, with an additional random initialization of the variables, warmup phase, and rdtscp to time the loop, and Clang used r14d in the loop). This loop executes at about 3 cycles/iteration.

Further testing shows that unrolling the loop any number of times greater than 2 makes the program execute in 2.5 billion cycles (vs 2 billion for the non-unrolled one).

The number of uops in a loop is 3*n+1 (where n is the unrolling factor, and 1 represent the fused add/jne), which means that the loop unrolled 3 times has 10 uops; 4 times 13 uops etc. Those are fairly small amounts of uops that should fit inside the DSB (uop cache). I'm on a Skylake with updated microcode with a fix for SKL150 so my LSD loop buffer is disabled.

Furthermore, unrolling 3, 4, 10, or 50 times doesn't change the performance at all: my code always run in 2.5 billion cycles (whereas the non-unrolled one runs in 2 billion cycles). It's somewhat surprising since the 3 additions should always execute in 1 cycle, and thus, if there was an extra cycle lost at the end of the loop for some reason, its overhead should be amortized when unrolling increases, and the asymptotic (in the unrolling factor) performance should get closer to 2 billion cycles.

Both llvm-mca and iaca predict that unrolling n times will make the loop execute in n cycles (which would make the whole program execute in 2 billion cycles).

To sum up, the question is: why is my loop 25% slower as soon as I unroll more than 2 times?

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/206633/discussion-on-question-by-dada-unrolling-1-cycle-loop-reduces-performance-by-25). — Samuel Liew, Jan 25 '20 at 14:11

Unrolling 1-cycle loop reduces performance by 25% on Skylake. uops scheduling issue?

0 Answers0

Linked