More efficient Asm with unconventional for-loop?

Question

I was playing around with compiler explorer, trying to learn a little more about ARM-Assembly. Im using arm64 msvc v19.latest. I noticed that I had one branch less like this:

int main(){
    for(unsigned i = 0; i<8;)
    i++;
    return 0;
}

compared to the "conventional" way of writing a for-loop like this:

int main(){
    for(unsigned i = 0; i<8;i++)
    ;
    return 0;
}

Is it therefore more efficient to write the for-loop in an unconventional way? I'll paste in both asm to compare. First with the unconventional method:

        ;Flags[SingleProEpi] functionLength[52] RegF[0] RegI[0] H[0] frameChainReturn[UnChained] frameSize[16]

|main|  PROC
|$LN6|
        sub         sp,sp,#0x10
        mov         w8,#0
        str         w8,[sp]
|$LN2@main|
        ldr         w8,[sp]
        cmp         w8,#8
        bhs         |$LN3@main|
        ldr         w8,[sp]
        add         w8,w8,#1
        str         w8,[sp]
        b           |$LN2@main|
|$LN3@main|
        mov         w0,#0
        add         sp,sp,#0x10
        ret

        ENDP  ; |main|

and the convetional way:

     ;Flags[SingleProEpi] functionLength[56] RegF[0] RegI[0] H[0] frameChainReturn[UnChained] frameSize[16]

|main|  PROC
|$LN6|
        sub         sp,sp,#0x10
        mov         w8,#0
        str         w8,[sp]
        b           |$LN4@main|
|$LN2@main|
        ldr         w8,[sp]
        add         w8,w8,#1
        str         w8,[sp]
|$LN4@main|
        ldr         w8,[sp]
        cmp         w8,#8
        bhs         |$LN3@main|
        b           |$LN2@main|
|$LN3@main|
        mov         w0,#0
        add         sp,sp,#0x10
        ret

        ENDP  ; |main|

Turning on the optimization will increase the perforemance more. — MikeCAT, Mar 21 '21 at 11:07
Did you compile this unoptimized? The results for that are pretty bad and non-conclusive. — Devolus, Mar 21 '21 at 11:08
Unless this is `-O3` optimized, any conclusions you reach here are completely irrelevant. Also worth using `++i` instead as on some things, like iterators, this can be more efficient, or at least easier to optimize. — tadman, Mar 21 '21 at 11:11
yes, your code does not have side effect thus good compilers should replace all by just `return 0` — OznOg, Mar 21 '21 at 11:11
@OznOg That's a good point. `int x = 0;` then inside `x += i;` would solve that so long as `x` is later used. — tadman, Mar 21 '21 at 11:12
The entire loop is removed when optimizations are enabled. And both `for(unsigned i=0; i<8; ) { printf("%d\n", i); i++; }` and the "standard" version produce the same unrolled loop with optimizations enabled. — ikegami, Mar 21 '21 at 11:12
@tadman, event that may not be enough, the compliler may unroll the loop completely and add a constant instead of the computations... hard to trick the compiler with such small pieces of code — OznOg, Mar 21 '21 at 11:14
@OznOg It's amusing that "optimizing" compilers used to be so dumb, and now we have to be really smart to even guess what they might do. — tadman, Mar 21 '21 at 11:15
Turning on optimization kind of defeats the purpose in this example, the for-loop doesn`t get anything done here, so when turning on optimitations, the asm is just: ret 0. That can not happen in a non-trivial example. — Markus Schmidt, Mar 21 '21 at 11:19
@tadman, The code still has no side effects then. But let's say you also add `printf("%d\n", x);`. Then the compiler generates the same code as `printf("%d\n", 28);` — ikegami, Mar 21 '21 at 11:20
Re "*Turning on optimization kind of defeats the purpose in this example,*", No, *you* missed the point: There's no point in comparing how optimized two unoptimized assemblies are. — ikegami, Mar 21 '21 at 11:21
@ikegami: I think the OP meant that fully optimizing away isn't interesting, and they don't know how to construct an example that will get the compiler to do something interesting while optimizing. Added [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) to the duplicate list for for tips on how to do that. e.g. for this case, https://godbolt.org/z/Wer9eE uses a volatile store inside the loop, or correct inline asm to make the compiler forget about the value of `i`. (see my comments on 0___________'s answer.) — Peter Cordes, Mar 21 '21 at 15:34

ikegami · Accepted Answer · 2021-03-21T11:24:45.957

2

If you want optimized code, ask your compiler for it! There's no point in examining how optimized unoptimized code is.

-O3 completely eliminates the loop.

Compiler Explorer demo: standard
Compiler Explorer demo: non-standard

If we add something with a side-effect to the loop, we get the exact same result from both approaches.

Compiler Explorer demo: standard
Compiler Explorer demo: non-standard

That optimized code is the equivalent of

printf("%d\n", 1);
printf("%d\n", 2);
printf("%d\n", 3);
printf("%d\n", 4);
printf("%d\n", 5);
printf("%d\n", 6);
printf("%d\n", 7);
printf("%d\n", 8);

edited Mar 21 '21 at 11:24

answered Mar 21 '21 at 11:17

ikegami

367,544
15
269
518

You did not prevent loop to be optimized making answer off-topic. – 0___________ Mar 21 '21 at 12:00
@0___________ Did you not read the answer? – ikegami Mar 21 '21 at 12:44
yes, and none of your examples preserve the loop. – 0___________ Mar 21 '21 at 13:13
1

@0___________, Despite what you say, you obviously did not, because the fact that loop was unrolled was mentioned. And the fact that this happens supports the point the answer makes, which you apparently also didn't read. – ikegami Mar 21 '21 at 14:39

score 1 · Answer 2 · answered Mar 21 '21 at 11:59

1

You have two problems with your example:

Compiler does not optimize the code.
Triviality

ad 1. Not optimized code is not suitable for any performance or output assembly comparisons.

ad 2. The triviality of code barres you from enabling the optimizations. You need to add something which will prevent the compiler from removing the code.

I will add some memory barriers (gcc)

void foo(){
    for(unsigned i = 0; i<8;)
    {
        i++;
        asm("":"=r"(i):"m"(i));
    }
}

void bar(){
    for(unsigned i = 0; i<8;i++)
    {
        asm("":"=r"(i):"m"(i));
    }
}

The generated code is exactly the same

foo:
        sub     sp, sp, #16
        mov     w0, 0
.L2:
        add     w0, w0, 1
        str     w0, [sp, 12]
        cmp     w0, 7
        bls     .L2
        add     sp, sp, 16
        ret
bar:
        sub     sp, sp, #16
        str     wzr, [sp, 12]
.L7:
        add     w0, w0, 1
        str     w0, [sp, 12]
        cmp     w0, 7
        bls     .L7
        add     sp, sp, 16
        ret

https://godbolt.org/z/zTjnjK

answered Mar 21 '21 at 11:59

0___________

60,014
4
34
74

`asm("":"=r"(i):"m"(i));` is super weird; it tells the compiler you want the input in memory, and that after the asm, the value of `i` will be in a register (using a write-only output operand). But your asm template *doesn't* do that load, so it's only luck that the leftover value the compiler happened to leave in the register it picked for `"=r"` is still `i`. This could totally break with more complex surrounding code. – Peter Cordes Mar 21 '21 at 13:45
Use `asm volatile("" : "+r"(i))` if you want to force the compiler to materialize a value in a register and forget what it knows about the value. (Including `volatile` so it can't decide the value is ultimately unused and optimize away the whole loop, which might in theory be possible in C++ where (unlike C) infinite loops without side effects are UB.) – Peter Cordes Mar 21 '21 at 13:45
@PeterCordes and how is it related to the answer? (hint : it should show that the position of the i++ will not affect the generated code) – 0___________ Mar 21 '21 at 14:59
Buggy inline asm that only happens to work is never a good example; it's hard enough for people to learn without finding misleading examples in SO answers. You call it a "memory barrier" but actually you're telling the compiler that `i` takes a value from a register without forcing the compiler to have `i`'s original value in the same register. Storing `i` to memory is a separate effect which you could get with `asm("" :: "m"(i))` as a separate asm statement. – Peter Cordes Mar 21 '21 at 15:08
IDK why you want to force the compiler to store `i` to memory anyway; `asm("" : "+r"(i))` has the desired effect of making a nice optimized do-while loop structure with the branch at the bottom: https://godbolt.org/z/Wer9eE, avoiding any stack manipulation to make space for spilling a local. – Peter Cordes Mar 21 '21 at 15:10
@PeterCordes no the idea was different and my asm is correct. – 0___________ Mar 21 '21 at 15:21
I'm certain (for reasons described earlier which you haven't refuted) that your asm is buggy, and would break if someone copied it into a more complicated benchmark loop or something to try to use it as a "memory barrier" like you describe. (Therefore it's a bad example of how to use inline asm, and needs to be downvoted until/unless it's fixed.) – Peter Cordes Mar 21 '21 at 15:25
Here's an example of your asm statement breaking https://godbolt.org/z/Y3qhhT - with the starting value of `i` as the 2nd function arg, it starts out in `w1`, but GCC still picks `x0/w0` as the output operand for the asm statement, so ends up using the wrong arg as the loop counter. – Peter Cordes Mar 21 '21 at 15:29

More efficient Asm with unconventional for-loop?

2 Answers2