Three versions of the same C program, why is the first one so fast?

Question

Here is a very simple C program:

int main()
{
    int n = 0;

    while(n != 1000000000){
        n += 1;
    }

return n;

}

I compiled it with Clang and timed it. It ran in 4.711095243692398e-06 seconds, or 0.000004711095243692398 seconds.

Next I output the C program to Intel syntax assembly language using the Godbolt compiler explorer (https://godbolt.org) to remove the .cfi directives:

.file   "Svx.c"
.intel_syntax noprefix
.text
.globl  main
.type   main, @function
main:
    push    rbp
    mov rbp, rsp
    mov DWORD PTR -4[rbp], 0
    jmp .L2
.L3:
    add DWORD PTR -4[rbp], 1
.L2:
    cmp DWORD PTR -4[rbp], 1000000000
    jne .L3
    mov eax, DWORD PTR -4[rbp]
    pop rbp
    ret
    .size   main, .-main
    .ident  "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
    .section    .note.GNU-stack,"",@progbits

I compiled it with GCC and timed it. The result was 1.96 seconds -- vastly slower than the Clang version.

Finally, I created my own assembly version:

[BITS 64]
[default rel]

global main:function

section .data align=16

section .text

main:
xor rax,rax
l_01:
cmp rax,1000000000
je l_02
add rax,5
jmp l_01
l_02:
ret

I compiled it with nasm and linked it with ld:

sudo nasm -felf64 Svx.asm

sudo ld -shared Svx.o -o Svx.so

and timed it. It ran in 0.14707629615440965 seconds.

Why does the C version run so fast if the reverse-compiled version runs vastly slower (0.0000047 seconds vs 1.96 seconds) and my NASM version runs in 0.147 seconds? I have a feeling that the result from the C version at 0.0000047 seconds is wrong; it seems impossibly fast. This is its Clang output to assembly language:

    .text
    .intel_syntax noprefix
    .file   "Svx.c"
    .globl  main                     # -- Begin function main
    .p2align    4, 0x90
    .type   main,@function
main:                                    # @main
    .cfi_startproc
# %bb.0:
    push    rbp
    .cfi_def_cfa_offset 16
    .cfi_offset rbp, -16
    mov rbp, rsp
    .cfi_def_cfa_register rbp
    mov dword ptr [rbp - 4], 0
.LBB0_1:                                # =>This Inner Loop Header:     Depth=1
    cmp dword ptr [rbp - 4], 1000000000
    je  .LBB0_3
# %bb.2:                                #   in Loop: Header=BB0_1   Depth=1
    mov eax, dword ptr [rbp - 4]
    add eax, 1
    mov dword ptr [rbp - 4], eax
    jmp .LBB0_1
.LBB0_3:
    mov eax, dword ptr [rbp - 4]
    pop rbp
    .cfi_def_cfa rsp, 8
    ret
.Lfunc_end0:
    .size   main, .Lfunc_end0-main
    .cfi_endproc
                                        # -- End function
    .ident  "clang version 8.0.0-3~ubuntu18.04.1 (tags/RELEASE_800/final)"
    .section    ".note.GNU-stack","",@progbits
    .addrsig

The listing shows they are using the stack for variables, not a register, which is (usually) slower.

The speed, at 0.0000047 seconds, seems impossibly fast to count to a billion. If that speed is correct, what's its secret? Reverse engineering doesn't reveal anything and in fact the Godbolt version is much slower.

Pretty sure `clang` is optimizing that useless loop and just doing `return 1000000000` — Marco Bonelli, Sep 22 '19 at 15:26
In other words, Clang sees that it's just counting to a billion and skips the counting step and loops out reporting the final expected number? — RTC222, Sep 22 '19 at 15:27
Exactly. That's a pretty common optimization. You can see the proof [here](https://godbolt.org/z/ANK8Xj). — Marco Bonelli, Sep 22 '19 at 15:27
It was intended as a speed test, so I should compile it without optimizations so it will loop all the way to a billion. I think that's my mistake -- I used -O3 optimization but that nullified the loop and the speed test will not be done. — RTC222, Sep 22 '19 at 15:30
To defeat the loop optimization and force the looping you can mark `n` as volatile like `volatile int n = 0;` — Michael Petch, Sep 22 '19 at 15:34
In addition to no optimization flag. Your idea could also help. — RTC222, Sep 22 '19 at 15:37
no, never benchmark without optimization. Unoptimized code can do crazy things and the timing result is not usable — phuclv, Sep 22 '19 at 15:54
Except that if the goal is to time counting to a billion to test speed, then we don't want optimizations. Just the simple loop. In all other cases we want to optimize. — RTC222, Sep 22 '19 at 15:59

score 2 · Accepted Answer · answered Sep 22 '19 at 15:36

Clang is just realizing that this loop is run 1000000000 times and doing the equivalent of return 1000000000;.

My output with -O3 as you specified that you're using:

        .text
        .file   "test.c"
        .globl  main                    # -- Begin function main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        movl    $1000000000, %eax       # imm = 0x3B9ACA00
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function

        .ident  "clang version 8.0.0 (tags/RELEASE_800/final)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

Notice the contents of main:

# %bb.0:
        movl    $1000000000, %eax       # imm = 0x3B9ACA00
        retq

This removes the loop entirely and just returns 1000000000.

A trick to get around this is to use volatile:

int main(void)
{
    volatile int n = 0;

    while(n != 1000000000) {
        n += 1;
    }

    return n;
}

Output (again with -O3):

        .text
        .file   "test.c"
        .globl  main                    # -- Begin function main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        movl    $0, -4(%rsp)
        movl    -4(%rsp), %ecx
        movl    -4(%rsp), %eax
        cmpl    $1000000000, %ecx       # imm = 0x3B9ACA00
        je      .LBB0_3
        .p2align        4, 0x90
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        addl    $1, %eax
        movl    %eax, -4(%rsp)
        movl    -4(%rsp), %ecx
        movl    -4(%rsp), %eax
        cmpl    $1000000000, %ecx       # imm = 0x3B9ACA00
        jne     .LBB0_1
.LBB0_3:
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function

        .ident  "clang version 8.0.0 (tags/RELEASE_800/final)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

What confused me is, in the assembly output from Clang (shown above), LBB0_1: cmp dword ptr [rbp - 4], 1000000000; je .LBB0_3 - is testing against a billion. So it didn't appear to be just looping out. — RTC222, Sep 22 '19 at 15:41
@RTC222 The conditions in assembly code are often reversed from the C code. This jumps ahead to the return if `n` is a billion and otherwise executes the loop contents. — S.S. Anne, Sep 22 '19 at 15:42
The OP is using Intel syntax; it's probably clearer if you do the same in your answer. (I think `-masm=intel` works for clang, or just use Godbolt like a normal person to remove the noise of extra directives.) — Peter Cordes, Sep 22 '19 at 19:56

score 2 · Answer 2 · answered Sep 22 '19 at 20:18

You have 3 cases:

C with optimization enabled: replace the loop with its result: mov eax, 1000000000 / ret. The time you measured is all overhead. The compiler can easily prove that n must have that value for the loop to exit, and that the loop is not infinite.
C with optimization disabled: keep C variables in memory, including the loop counter. The loop bottlenecks on store-forwarding latency at about 6 cycles per iteration on modern x86 CPUs. (https://agner.org/optimize/)
Hand-written asm that loops (inefficiently) but does at least keep values in registers. So the loop-carried dependency chain (incrementing n) only has 1 cycle latency.

And since you use add rax,5, you're only doing 1/5th of the iterations of an n++ loop. You could think of this as unrolling by 5 and then optimizing 5x n++ into n+=5. You can make this factor as large as you want and make the running time arbitrarily small, until you get to mov eax, 1000000000 like the compiler did.

See the first 2 on Godbolt where I used clang -O3 and gcc -O0. Note that int n is a 32-bit variable in standard x86-64 ABIs so there's no need to spend extra code size (REX prefixes) for 64-bit operand-size.

See Why are loops always compiled into "do...while" style (tail jump)? for why efficient simple loops have the conditional branch at the bottom and no unconditional branch. Note that this is what gcc does even at -O0 (entering the loop with a jmp to the loop condition at the bottom).

Clang makes even more naive code at -O0 that has the same structure as the C with a break condition at the top and an unconditional jmp at the bottom, like your hand-written asm.

So your asm should be about 6 * 5 times faster than the anti-optimized C compiler output, or half that if it doesn't manage to run your NASM loop at 1 clock cycle per iteration. In practice you measured a factor of 13.333 which is pretty close to 15.

So you probably have an Intel before Haswell or AMD before Ryzen. More recent CPUs have a throughput of 2 branches per clock if at least one of them is not-taken.

Or a Skylake (loop-buffer disabled by a microcode update) and some front-end effect (like having the loop split across a 64-byte boundary) stopped it from issuing at 1 iter / clock because it couldn't read from uop cache fast enough.

As this is a test of counting by 1s to 1 billion, I eliminated the loop unrolling (add rax,5 is now add rax,1) so my speed is 0.35 versus 1.87 for unoptimized Clang, or 5.3 times faster (instead of the 12.5 times faster with loop unrolling) due to storing the variables in registers. — RTC222, Sep 22 '19 at 21:26
@RTC222: Agner Fog lists memory-destination `add` as 6 cycle latency on Sandybridge. So that's about right for store-forwarding latency vs. 1 cycle for register destination. I assume you also optimized your loop condition so your NASM version can run at 1 clock per iteration. You're measuring wall-clock time not cycles, so there's probably some turbo vs. idle clock effects for the relatively short 0.35 second version. — Peter Cordes, Sep 22 '19 at 21:44

Three versions of the same C program, why is the first one so fast?

2 Answers2