Why does ICC unroll this loop in this way and use lea for arithmetic?

Question

Looking at the ICC 17 generated code for iterating over a std::unordered_map<> (using https://godbolt.org) left me very confused.

I distilled down the example to this:

long count(void** x)
{
  long i = 0;
  while (*x)
  {
    ++i;
    x = (void**)*x;
  }
  return i;
}

Compiling this with ICC 17, with the -O3 flag, leads to the following disassembly:

count(void**):
        xor       eax, eax                                      #6.10
        mov       rcx, QWORD PTR [rdi]                          #7.11
        test      rcx, rcx                                      #7.11
        je        ..B1.6        # Prob 1%                       #7.11
        mov       rdx, rax                                      #7.3
..B1.3:                         # Preds ..B1.4 ..B1.2
        inc       rdx                                           #7.3
        mov       rcx, QWORD PTR [rcx]                          #7.11
        lea       rsi, QWORD PTR [rdx+rdx]                      #9.7
        lea       rax, QWORD PTR [-1+rdx*2]                     #9.7
        test      rcx, rcx                                      #7.11
        je        ..B1.6        # Prob 18%                      #7.11
        mov       rcx, QWORD PTR [rcx]                          #7.11
        mov       rax, rsi                                      #9.7
        test      rcx, rcx                                      #7.11
        jne       ..B1.3        # Prob 82%                      #7.11
..B1.6:                         # Preds ..B1.3 ..B1.4 ..B1.1
        ret                                                     #12.10

Compared to the obvious implementation (which gcc and clang use, even for -O3), it seems to do a few things differently:

It unrolls the loop, with two decrements before looping back - however, there is a conditional jump in the middle of it all.
It uses lea for some of the arithmetic
It keeps a counter (inc rdx) for every two iterations of the while loop, and immediately computes the corresponding counters for every iteration (into rax and rsi)

What are the potential benefits to doing all this? I assume it may have something to do with scheduling?

Just for comparison, this is the code generated by gcc 6.2:

count(void**):
        mov     rdx, QWORD PTR [rdi]
        xor     eax, eax
        test    rdx, rdx
        je      .L4
.L3:
        mov     rdx, QWORD PTR [rdx]
        add     rax, 1
        test    rdx, rdx
        jne     .L3
        rep ret
.L4:
        rep ret

Advantages of `lea` include: (1) Allows two source operands both of which may be different from the result, while `add` requires one source operand to be identical to the result; use of `lea` may avoid the use of an additional`mov` to preserve the shared source operand (2) Allows for simple multiplication by means of the built-in scale factor (3) Doesn't affect flags, allowing greater flexibility in scheduling the instruction. — njuffa, Dec 10 '16 at 23:00
`lea` has been used for arithmetic since the beginning of times. Basically, it it is more complicated than `inc`/`dec` and `lea` can do it, then `lea` is the most efficient way to do it. For which reason, it is not clear what prompted your question about `lea`. If you can read assembly, then you should know about `lea` and its role already. — AnT stands with Russia, Dec 11 '16 at 04:06

score 6 · Accepted Answer · edited May 23 '17 at 10:30

This isn't a great example because the loop trivially bottlenecks on pointer-chasing latency, not uop throughput or any other kind of loop-overhead. But there can be cases where having fewer uops helps an out-of-order CPU see farther ahead, maybe. Or we can just talk about the optimizations to the loop structure and pretend they matter, e.g. for a loop that did something else.

Unrolling is potentially useful in general, even when the loop trip-count is not computable ahead of time. (e.g. in a search loop like this one, which stops when it finds a sentinel). A not-taken conditional branch is different from a taken branch, since it doesn't have any negative impact on the front-end (when it predicts correctly).

Basically ICC just did a bad job unrolling this loop. The way it uses LEA and MOV to handle i is pretty braindead, since it used more uops than two inc rax instructions. (Although it does make the critical path shorter, on IvB and later which have zero-latency mov r64, r64, so out-of-order execution can get ahead on running those uops).

Of course, since this particular loop bottlenecks on the latency of pointer-chasing, you're getting at best a long-chain throughput of one per 4 clocks (L1 load-use latency on Skylake, for integer registers), or one per 5 clocks on most other Intel microarchitectures. (I didn't double-check these latencies; don't trust those specific numbers, but they're about right).

IDK if ICC analyses loop-carried dependency chains to decide how to optimize. If so, it should probably have just not unrolled at all, if it knew it was doing a poor job when it did try to unroll.

For a short chain, out-of-order execution might be able to get started on running something after the loop, if the loop-exit branch predicts correctly. In that case, it is useful to have the loop optimized.

Unrolling also throws more branch-predictor entries at the problem. Instead of one loop-exit branch with a long pattern (e.g. not-taken after 15 taken), you have two branches. For the same example, one that's never taken, and one that's take 7 times then not-taken the 8th time.

Here's what a hand-written unrolled-by-two implementation looks like:

Fix up i in the loop-exit path for one of the exit points, so you can handle it cheaply inside the loop.

count(void**):
    xor       eax, eax              # counter
    mov       rcx, QWORD PTR [rdi]  # *x
    test      rcx, rcx
    je        ..B1.6
.p2align 4   # mostly to make it more likely that the previous test/je doesn't decode in the same block at the following test/je, so it doesn't interfere with macro-fusion on pre-HSW
.loop:
    mov       rcx, QWORD PTR [rcx]
    test      rcx, rcx
    jz        .plus1

    mov       rcx, QWORD PTR [rcx]
    add       rax, 2
    test      rcx, rcx
    jnz       .loop
..B1.6:
    ret

.plus1:           # exit path for odd counts
    inc       rax
    ret

This makes the loop body 5 fused-domain uops if both TEST/JCC pairs macro-fuse. Haswell can make two fusions in a single decode groups, but earlier CPUs can't.

gcc's implementation is only 3 uops, which is less than the issue width of the CPU. See this Q&A about small loops issuing from the loop buffer. No CPU can actually execute/retire more than one taken branch per clock, so it's not easily possible to test how CPUs issue loops with less than 4 uops, but apparently Haswell can issue a 5-uop loop at one per 1.25 cycles. Earlier CPUs might only issue it at one per 2 cycles.

Do I understand correctly that the "more branch predictor entries" point means that if I commonly have a linked list of exactly one element, it would branch predict better by marking the first branch as usually-taken and the second one as usually-not-taken? — Aristid Breitkreuz, Dec 11 '16 at 11:20
@AristidBreitkreuz: Yes, exactly. Branch predictors update themselves based on what actually happened, so after a few calls with lists of length 1, the unrolled version would have settled into that very-simple prediction pattern. (But note that they'd both predict strongly *not*-taken: the first to stay in the loop, to second to fall out of the loop.) For longer lists, modern branch predictors can "lock on" to patterns like alternating taken/not-taken, and stuff like that. (Not much is published about exactly what they can do, it's part of the CPU vendor's secret sauce) — Peter Cordes, Dec 11 '16 at 19:29

score 1 · Answer 2 · answered Dec 10 '16 at 23:02

1

There's no definite answer to why it does it, as it is a proprietary compiler. Only intel knows why. That said, Intel compiler is often more aggressive in loop optimization. It does not mean it is better. I have seen situations where intel's aggressive inlining lead to worse performance than clang/gcc. In that case, I had to explicitly forbid inlining at some call sites. Similarly, sometime it is necessary to forbid unrolling via pragmas in Intel C++ to get better performance.
lea is a particularly useful instruction. It allows one shift, two addition, and one move all in just one instruction. It is much faster than doing these four operations separated. However, it does not always make a difference. And if lea is used only for an addition or a move, it may or may not be better. So you see in 7.11 it uses a move, while in the next two lines lea is used to do an addition plus move, and addition, shift, plus a move
I don't see there's a optional benefit here

answered Dec 10 '16 at 23:02

Yan Zhou

2,709
2
22
37

1

ICC is good at auto-vectorizing, but I've often seen worse scalar integer code from it than clang or gcc on godbolt. I haven't benchmarked it, though, and CPUs are often able to plow through a lot of extra instructions, so I don't know how much impact the obviously-worse code will have in the cases I've seen. – Peter Cordes Dec 11 '16 at 03:55
I added an answer to point out that ICC really just did a bad job here. LEA is useful, but it shouldn't have been doing that much work *inside* the loop in the first place. – Peter Cordes Dec 11 '16 at 03:59
@petercordes I agree. It is often too aggressive and always sacrificing code size in favor of some superstitious gain in speed – Yan Zhou Dec 11 '16 at 07:08
I didn't mean overly-aggressive code-size problems. I just meant missed-optimizations for transforming integer stuff, compared to the best of either clang or gcc for getting the same thing done with fewer instructions. – Peter Cordes Dec 11 '16 at 08:26
@PeterCordes I know. I was talking about ICC in general. Maybe different people have different issues with it because of what we program for – Yan Zhou Dec 11 '16 at 09:09

Why does ICC unroll this loop in this way and use lea for arithmetic?

2 Answers2