while(i--) optimization by gcc and clang: why don't they use sub / jnc?

Question

Some people write such code when they need a loop without a counter or with a n-1, ..., 0 counter:

while (i--) { ... }

A specific example:

volatile int sink;
void countdown_i_used() {
    unsigned i = 1000;
    while (i--) {
         sink = i;  // if i is unused, gcc optimizes it away and uses dec/jnz
    }
}

On GCC 8.2 (on the Godbolt compiler explorer), it's compiled into

# gcc8.2 -O3 -march=haswell
.L2:
    mov     DWORD PTR sink[rip], eax
    dec     eax                      # with tune=generic,  sub eax, 1
    cmp     eax, -1
    jne     .L2

On clang (https://godbolt.org/z/YxYZ95), if the counter is not used it turns into

if(i) do {...} while(--i);

but if used, like GCC it turns into

add esi, -1
cmp esi, -1
jnz lp

However, this seems to be a better idea:

sub esi, 1
jnc lp

Why don't these two compilers use this way?

Because the cmp way is better? Or because they won't save space this way and they are almost the same speed?

Or do they just not consider this option?

Update: Even if I write code to use the carry way (Here I use add/jc but it's same)

bool addcy(unsigned& a, unsigned b) {
    unsigned last_a = a;
    a+=b;
    return last_a+b<last_a;
}
volatile unsigned sink;
void f() {

    for (unsigned i=100; addcy(i, -1); ) {
        sink = i;
    }
}

compiler still compile it as checking equality to -1. However, if the 100 is replaced with an input, the JC code remain

gcc is surprisingly bad at optimizing loops down to one macro-fused uop of overhead. With `for(int i=0 ; i < size ; i++)` loops, you tend to get `add` / `cmp/jcc` instead of `dec size/jnz` even if `i` is unused inside the loop. But yes `sub` / `jnc` would probably be optimal; that can macro-fuse on Sandybridge-family. — Peter Cordes, Jan 20 '19 at 15:57
It looks like you're using `i` inside the loop, otherwise gcc does better. — Peter Cordes, Jan 20 '19 at 16:07

score 8 · Accepted Answer · answered Jan 20 '19 at 16:26

8

Yes, this is a missed optimization. Intel Sandybridge-family can macro-fuse sub/jcc into a single uop, so sub/jnc saves code-size, x86 instructions, and uops on those CPUs.

On other CPUs (e.g. AMD which can only fuse test/cmp with jcc), this still saves code size so it's at least slightly better. It's not worse on anything.

It would be a good idea to report missed-optimization bugs on https://bugs.llvm.org and https://gcc.gnu.org/bugzilla/.

answered Jan 20 '19 at 16:26

Peter Cordes

328,167
45
605
847

Found that gcc even compile a `jc` code back into `jnz` code, so what's happening? – l4m2 Dec 19 '19 at 05:24
1

@l4m2: You don't have a literal `jc` in your update, you just have an idiom that compilers *can* recognize as a carry-out check. But with a constant trip-count, the compiler can see when the loop will end and compile it in the sub-optimal way it apparently prefers. Report a missed-optimization bug if you want compiler devs to change gcc and LLVM to not screw this up. – Peter Cordes Dec 19 '19 at 05:30

while(i--) optimization by gcc and clang: why don't they use sub / jnc?

1 Answers1