4

I have been playing around with the Godbolt Compiler and typed this code:

constexpr int func(int x)
{
    return x > 3 ? x * 2 : (x < -4 ? x - 4 : x / 2);
}

int main(int argc)
{
    return func(argc);
}

The code is somewhat straight forward. The important part here is the final division by 2 inside func(int x). Since x is an integer, basically any compiler would simplify this to a shift to avoid division instructions.

The assembly from x86-64 gcc 12.2 -O3 (for Linux, thus System V ABI) looks like this:

main:
        cmp     edi, 3
        jle     .L2
        lea     eax, [rdi+rdi]
        ret
.L2:
        cmp     edi, -4
        jge     .L4
        lea     eax, [rdi-4]
        ret
.L4:
        mov     eax, edi
        mov     ecx, 2
        cdq
        idiv    ecx
        ret

You can see the final idiv ecx command which is not a shift but an actual division by 2. I also tested clang and clang does actually reduce this to a shift.

main:                                   # @main
        mov     eax, edi
        cmp     edi, 4
        jl      .LBB0_2
        add     eax, eax
        ret
.LBB0_2:
        cmp     eax, -5
        jg      .LBB0_4
        add     eax, -4
        ret
.LBB0_4:
        mov     ecx, eax
        shr     cl, 7
        add     cl, al
        sar     cl
        movsx   eax, cl
        ret

Could this be due to inlining maybe? I am very curious about whats going on here.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Finn Eggers
  • 857
  • 8
  • 21
  • 5
    it gets weirder, https://godbolt.org/z/Efs18TYnM, appears to matter if it's main or not. – Sopel Sep 22 '22 at 20:52
  • 2
    A shift doesn't do the job when the input is negative (this can be fixed, but the fix comes at a price that may make the compiler decide it isn't worth it). Why do you think `constexpr` is related? This evaluation cannot be done at compile-time because the input is a variable. – Ben Voigt Sep 22 '22 at 21:00
  • 1
    @Sopel: The compiler *should* know that the `argc` argument of `main()` cannot be negative and eliminate the `< -4` branch entirely – Ben Voigt Sep 22 '22 at 21:01
  • x is a `int`, which is represented using two's complement. shifting a negative number may result in incorrect results. – ichramm Sep 22 '22 at 21:02
  • 1
    @Finn also please don't require readers to figure out what architecture this is based on the presence of certain specific instructions and register names. Just state what architecture the assembler code is for, up front. – Ben Voigt Sep 22 '22 at 21:03
  • @BenVoigt I am sorry, this is for x86-x64. You said once that the compiler should know its positive. why are we talking about negative numbers then? – Finn Eggers Sep 22 '22 at 21:09
  • Well you wrote code that compared the value to `-4`, and that comparison is definitely being done, so the compiler clearly is not taking advantage of the fact `argc` can't be negative. – Ben Voigt Sep 22 '22 at 21:24
  • 1
    @BenVoigt "The compiler should know that the argc argument of main() cannot be negative" I doubt this optimization would worse time implementing it. – Slava Sep 22 '22 at 21:42
  • @Slava: I agree, that kind of peephole optimization has very poor cost-benefit analysis. I was just offering it as a reason that `main()` vs "not-`main`" could make a difference. – Ben Voigt Sep 22 '22 at 21:44
  • Why do you think that division is slower than shift here? Did you measure? – Slava Sep 22 '22 at 21:54
  • "Since x is an integer, basically any compiler would simplify this to a shift to avoid division instructions." Looks like gcc writers have different opinion, why do you think you are correct, not them? – Slava Sep 22 '22 at 21:57
  • @Slava: Division *is* slower than shift. Division may or may not be slower than "shift and adjust the result in case the input is negative", especially if the latter is longer machine code, less likely to fit in instruction cache, or messes up the alignment of other instructions. – Ben Voigt Sep 22 '22 at 22:26
  • @BenVoigt right, if we change argument to `unsigned` then gcc will generate shift. In case of signed int it may generate shift may not. So question should be changed to either "why compiler does not detect that `argc` is always positive (neither gcc nor clang do IMHO) or if shift is faster in case of signed integer". In current form it is ambiguous and and unasnwerable as it is based on statement that is questionable. – Slava Sep 22 '22 at 22:29
  • @BenVoigt: Sopel's comment is the key to the answer: GCC optimizes `main` less than other functions. This is a somewhat-known fact. Division is definitely slower than a fixup for `sar`, it's just a size/speed tradeoff. (One where GCC's heuristics are maybe a bit too eager to use division.) On some CPUs maybe not tremendously slower, especially for small dividends, but on other CPUs, especially for 64-bit integers, it's a total disaster. (e.g. Intel before Ice Lake.) – Peter Cordes Sep 23 '22 at 01:18

1 Answers1

15

GCC treats main specially: implicit __attribute__((cold))

So main gets optimized less (or favouring size over speed), as it's usually only called once in most programs. __attribute__((cold)) isn't quite the same as -Os (optimize for size), but it's a step in that direction which sometimes gets the cost heuristics to pick a naive division instruction.

As GCC dev Marc Glisse commented, don't put your code in a function called main if you're benchmarking it or looking at how it optimizes. (There can be other special things besides cold, e.g. MinGW GCC puts in an extra call to an init function, and gcc -m32 adds code to align the stack by 16. All of which are noise you don't want for the code you're looking at. See also How to remove "noise" from GCC/clang assembly output?)

Another Q&A shows GCC putting main in the .text.startup section, along with other presumed "cold" functions. (This is good for TLB and paging locality; hopefully a whole page of init functions can get evicted after process startup. The idea is that code in main probably only runs once, with real work happening inside some function it calls. This may not be true if real work inlines into main, or for simple programs.)

This is a bad heuristic for toy programs with all their code in main, but it's what GCC does. Most real programs people run regularly aren't toys and have enough code in some other function that it doesn't inline into main. Although it would be nice if the heuristic was a little smarter and removed the cold if it turned out that the whole program or all the functions in a loop did optimize into main, since some real programs are pretty simple.

You can override the heuristic with a GNU C function attribute.

  • __attribute__((hot)) int main(){ ... optimizes the way you expect
    (Godbolt from Sopel's comment, with attribute added).
  • __attribute__((cold)) on a function not called main produces idiv.
  • __attribute__((optimize("O3"))) doesn't help.

int main(int x, char **y){ return x/2; } does still use shifts with gcc -O2, so main being cold doesn't always have that effect (unlike -Os).

But perhaps with your division already being conditional, GCC guesses that basic block doesn't even run every time so that's more reason to make it small instead of fast.


Insanely, GCC -Os for x86-64 (Godbolt) does use idiv for signed division a constant 2, not just for arbitrary constants (where GCC normally uses a multiplicative inverse even at -O0). It doesn't save much if any code size vs. an arithmetic right shift with a fixup to round towards zero (instead of -inf), and can be much slower, especially for 64-bit integers on Intel before Ice Lake. Same for AArch64, where it's 2 fixed-size instructions either way, with sdiv almost certainly being much slower.

sdiv does save some code size on AArch64 for higher powers of 2 (Godbolt), but still so much slower that it's probably not a good tradeoff for -Os. idiv doesn't save instructions on x86-64 (as cdq or cqo into RDX is required), although maybe a couple bytes of code size. So probably only good for -Oz where it would also use push 2 / pop rcx to get a small constant into a register in 3 bytes of x86-64 machine code instead of 5.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thats a very good explanation and thanks for the insight. I didnt know that main is being optimised less. I guess there is a lot of stuff going on under the hood which seems very counterintuitive if one is not familiar with this. Ill accept this answer :) – Finn Eggers Sep 23 '22 at 06:00