Why does my empty loop run twice as fast if called as a function, on Intel Skylake CPUs?

Question

I was running some tests to compare C to Java and ran into something interesting. Running my exactly identical benchmark code with optimization level 1 (-O1) in a function called by main, rather than in main itself, resulted in roughly double performance. I'm printing out the size of test_t to verify beyond any doubt that the code is being compiled to x64.

I sent the executables to my friend who's running an i7-7700HQ and got similar results. I'm running an i7-6700.

Here's the slower code:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

int main() {
    printf("Size = %I64u\n", sizeof(size_t));
    int start = clock();
    for(int64_t i = 0; i < 10000000000L; i++) {
        
    }
    printf("%ld\n", clock() - start);
    return 0;
}

And the faster:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

void test() {
    printf("Size = %I64u\n", sizeof(size_t));
    int start = clock();
    for(int64_t i = 0; i < 10000000000L; i++) {
        
    }
    printf("%ld\n", clock() - start);
}

int main() {
    test();
    return 0;
}

I'll also provide the assembly code for you to dig in to. I don't know assembly. Slower:

    .file   "dummy.c"
    .text
    .def    __main; .scl    2;  .type   32; .endef
    .section .rdata,"dr"
.LC0:
    .ascii "Size = %I64u\12\0"
.LC1:
    .ascii "%ld\12\0"
    .text
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    pushq   %rbx
    .seh_pushreg    %rbx
    subq    $32, %rsp
    .seh_stackalloc 32
    .seh_endprologue
    call    __main
    movl    $8, %edx
    leaq    .LC0(%rip), %rcx
    call    printf
    call    clock
    movl    %eax, %ebx
    movabsq $10000000000, %rax
.L2:
    subq    $1, %rax
    jne .L2
    call    clock
    subl    %ebx, %eax
    movl    %eax, %edx
    leaq    .LC1(%rip), %rcx
    call    printf
    movl    $0, %eax
    addq    $32, %rsp
    popq    %rbx
    ret
    .seh_endproc
    .ident  "GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0"
    .def    printf; .scl    2;  .type   32; .endef
    .def    clock;  .scl    2;  .type   32; .endef

Faster:

    .file   "dummy.c"
    .text
    .section .rdata,"dr"
.LC0:
    .ascii "Size = %I64u\12\0"
.LC1:
    .ascii "%ld\12\0"
    .text
    .globl  test
    .def    test;   .scl    2;  .type   32; .endef
    .seh_proc   test
test:
    pushq   %rbx
    .seh_pushreg    %rbx
    subq    $32, %rsp
    .seh_stackalloc 32
    .seh_endprologue
    movl    $8, %edx
    leaq    .LC0(%rip), %rcx
    call    printf
    call    clock
    movl    %eax, %ebx
    movabsq $10000000000, %rax
.L2:
    subq    $1, %rax
    jne .L2
    call    clock
    subl    %ebx, %eax
    movl    %eax, %edx
    leaq    .LC1(%rip), %rcx
    call    printf
    nop
    addq    $32, %rsp
    popq    %rbx
    ret
    .seh_endproc
    .def    __main; .scl    2;  .type   32; .endef
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    subq    $40, %rsp
    .seh_stackalloc 40
    .seh_endprologue
    call    __main
    call    test
    movl    $0, %eax
    addq    $40, %rsp
    ret
    .seh_endproc
    .ident  "GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0"
    .def    printf; .scl    2;  .type   32; .endef
    .def    clock;  .scl    2;  .type   32; .endef

Here's my batch script for compilation:

@echo off
set /p file= File to compile: 
del compiled.exe
gcc -Wall -Wextra -std=c17 -O1 -o compiled.exe %file%.c
compiled.exe
PAUSE

And for compilation to assembly:

@echo off
set /p file= File to compile: 
del %file%.s
gcc -S -Wall -Wextra -std=c17 -O1 %file%.c
PAUSE

I don't believe it. I can't see how empty loop is not optimized out. — SergeyA, Jun 07 '21 at 19:54
Did you run the tests in random order, several times, discarding outliers? — pmg, Jun 07 '21 at 19:55
@SergeyA It did get optimized out if I used anything higher than -O1, which is why I used -O1 and not -O3. — Anonymous Noob, Jun 07 '21 at 19:58
@pmg I ran both tests multiple times with a few minutes in between and with relatively arbitrary/random order whilst discussing this strangeness with my friends. There was some variation, of course, but not very much. The general trend was that the faster code was roughly twice as fast as the slower code. I also tried doubling the maximum value from 10bil to 20bil to the same effect. It's also worth noting that the results were on the order of seconds, so it's not like tiny amounts of overhead or random delays could've caused this. — Anonymous Noob, Jun 07 '21 at 19:58
It could be due to the address of the branch (which is different in both versions, as a random side effect) combined with [the fix for the jcc erratum](https://stackoverflow.com/q/62305998/555045), if you provide the address of `L2` in both cases then that can be checked — harold, Jun 07 '21 at 20:10
@harold How would you like me to go about doing that? I'd be happy to provide you with any info you need... I just don't know how. — Anonymous Noob, Jun 07 '21 at 20:14
You should be able to see the address in a debugger, your choice of tooling confuses me slightly (seem like you're using GCC on windows? so maybe you don't use visual studio as your debugger?) but any of them should be able to tell you this information. Btw ASLR doesn't matter, the alignment relevant for the fix of the JCC erratum is 32 so not affected by ASLR. — harold, Jun 07 '21 at 20:24
@harold I don't code in C very often... I'm using VS Code and I don't really have a debugger. What's ALSR? — Anonymous Noob, Jun 07 '21 at 20:26
IDK how to set up debugging in VS Code but surely it is possible somehow? Or I suppose you could use x64dbg. Anyway ASLR randomizes addresses somewhat, so I mentioned that to avoid you getting worried about that, it shouldn't affect this question. — harold, Jun 07 '21 at 20:30
@AnonymousNoob Probably you run this program when your computer has a different load. This code will give you almost exactly the same results. Your results make no sense so I believe that your experiment methodology is invalid. — 0___________, Jun 07 '21 at 20:36
@0___________ I don't want to conclude yet that it is the JCC erratum thing, but *if* it is then it's also difficult to reproduce, because it relies on exact code placement (which your compiler won't necessarily reproduce) and on running it on a particular family of processors — harold, Jun 07 '21 at 20:46
@harold it will not **double** execution time. It will affect the same both versions. — 0___________, Jun 07 '21 at 20:51
@0___________ I sent the executables to my friend who's running an i7-7700HQ and got similar results. I'm running an i7-6700. — Anonymous Noob, Jun 07 '21 at 20:53
I am running on I9-something. MAybe you got a virus which affects the execution :) — 0___________, Jun 07 '21 at 20:59
@0___________ there's no a-priori version why both versions should be affected the same. It really comes down to where exactly those branches are. If they're nowhere near a 32byte boundary, sure, it won't matter. But the addresses of those branches are for sure not *equal* in both versions (different offset from starting label), and if one of them is near a 32 byte boundary in one of the bad ways and the other is in the clear, then that's a big difference. — harold, Jun 07 '21 at 21:00
@harold I don't really know what I'm looking at here... It's a lot more complicated than Java debuggers I've used! Hopefully this helps. This is the fast version: https://i.imgur.com/UWrGVVw.png And the slow version: https://i.imgur.com/e7RJi28.png — Anonymous Noob, Jun 07 '21 at 21:00
I'm running an i7-920 and I do _not_ get your results. I'm running under linux. So, I'd suggest that you boot linux and see if you get similar results on the same processor/system. That will tell you something. (You could probably get the results by booting a shell off a "live" USB installer if you don't want to install to your hard disk(s)). I get a max variation of 100,000 clocks over a few hundred trials, nowhere near 2x. And, I take the _minimum_ time of N trials for each program (usually 3-10 trials) to eliminate cache variations and swappiness. — Craig Estey, Jun 07 '21 at 21:01
@AnonymousNoob please press F9 once, to skip out of this system code into your own code. — harold, Jun 07 '21 at 21:02
@harold Fast: https://i.imgur.com/ddlUwLk.png Slow: https://i.imgur.com/wdZQS2Q.png (ayy figured out dark theme) — Anonymous Noob, Jun 07 '21 at 21:06
@AnonymousNoob OK that wasn't it either, closer I suppose, but the loop is not in sight, you could look for it but if you somehow give me the executable I'd be willing to take a look (E: btw I won't be able to reproduce it, my CPU is from the Haswell family so not affected by the JCC erratum, but I can find out if the branch could be affected by that) — harold, Jun 07 '21 at 21:08
@harold Sure. Slow: https://www.file.io/download/NEYMbrxFIivX Fast: https://file.io/mnFha7kqY72n Thank you for taking the time to help me try to understand this. My friend and I will be very happy once we get to the bottom of this mystery. — Anonymous Noob, Jun 07 '21 at 21:13
@harold (This is in another comment to make sure you see it) I added a getchar() to the end of both executables because I was sending them to my friend and wanted him to just be able to run them without any trouble. It doesn't seem to affect the results. — Anonymous Noob, Jun 07 '21 at 21:18
@0___________ I did get my answer... but you had no reason to invalidate my methodology just because my results didn't make sense. Apparently, they DID make sense. :) — Anonymous Noob, Jun 07 '21 at 21:24
@0___________: *Maybe you got a virus which affects the execution* - Actually a microcode update from Intel which introduces a performance pothole to work around a bug they missed while testing their CPUs. Quality control in Skylake [was worse than in previous generations](https://www.extremetech.com/computing/312181-ex-intel-engineer-claims-skylake-qa-problems-drove-apple-away), and there have been at least a couple new performance potholes introduced by microcode updates since release. (Disabling LSD, and the JCC erratum workaround, are the most notable.) — Peter Cordes, Jun 08 '21 at 00:47
And Ice Lake recently disabled mov-elimination to work around another design bug. :( Anyway, it's normal for weird perf effects in very small-scale microbenchmarks to be specific to a CPU version. @0___________, if you'd said what i9 version you tested on, your comment wouldn't have been totally useless. — Peter Cordes, Jun 08 '21 at 00:48

harold · Accepted Answer · 2021-06-07T22:21:39.390

The slow version:

Note that the sub rax, 1 \ jne pair goes right across the boundary of the ..80 (which is a 32byte boundary). This is one of the cases mentioned in Intels document regarding this issue namely as this diagram:

So this op/branch pair is affected by the fix for the JCC erratum (which would cause it to not be cached in the µop cache). I'm not sure if that is the reason, there are other things at play too, but it's a thing.

In the fast version, the branch is not "touching" a 32byte boundary, so it is not affected.

There may be other effects that apply. Still due to crossing a 32byte boundary, in the slow case the loop is spread across 2 chunks in the µop cache, even without the fix for JCC erratum that may cause it to run at 2 cycles per iteration if the loop cannot execute from the Loop Stream Detector (which is disabled on some processors by an other fix for an other erratum, SKL150). See eg this answer about loop performance.

To address the various comments saying they cannot reproduce this, yes there are various ways that could happen:

Whichever effect was responsible for the slowdown, it is likely caused by the exact placement of the op/branch pair across a 32byte boundary, which happened by pure accident. Compiling from source is unlikely to reproduce the same circumstances, unless you use the same compiler with the same setup as was used by the original poster.
Even using the same binary, regardless of which of the effects is responsible, the weird effect would only happen on particular processors.

Fun fact: even without the JCC erratum, tiny loops of more than 1 uop that get split across 2 cache lines can run half speed on Skylake-family CPUs (which have their loop buffer disabled by [a microcode update in SKL](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#.C2.B5OP-Fusion_.26_LSD), and in initial microcode on later CPUs until Ice Lake). Because the uop cache can only fetch from 1 line per cycle, so it has to alternate between lines to fetch the whole loop body. (Oh, you already mentioned that effect.) — Peter Cordes, Jun 08 '21 at 00:38
I think even in this sub/jcc loop (which can macro-fuse into a single uop), it could actually fail to macro-fuse and end up as 2 uops in 2 separate cache lines even without the JCC erratum. IIRC, Macro-fusion doesn't work when split perfectly across an I-cache line boundary (64 bytes). But on CPUs with a working LSD, those 2 uops can still be "unrolled" and issued in groups of 4. ([On Haswell and later, or at least in groups of 2 on SnB](https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of)) — Peter Cordes, Jun 08 '21 at 00:42
Note that similar effects affect AMD Ryzen processors : instruction decoding is slower when a small loop is split in two cache lines resulting in an overall slow loop (due to the instruction cache). — Jérôme Richard, Jun 08 '21 at 12:14

Why does my empty loop run twice as fast if called as a function, on Intel Skylake CPUs?

1 Answers1

Linked