0

Suppose there's is a recursive function foo() similar to Fibonacci's Sequence:

constexpr inline int foo(const int i)
{
    if(i == 0) 
        return 1;
    return foo(i - 1) + foo(i - 2);
}

int main(int argc, const char* argv[])
{
    const int ret = foo(argc);
    return ret;
}

Then use GCC on https://gcc.godbolt.org/z/fETb6Px33to compile it under different optimization levels like -Os -std=c++2b -fno-exceptions, I wonder why the O2/O3 version seems to be so big compares to O1/Os version, just like the monster Godzilla. Apparently, some critical optimization triggers the inflation of binary size, maybe the O3 tried to boost the executive speed in runtime at the cost of size.

O0

foo(int):
        push    rbp
        mov     rbp, rsp
        push    rbx
        sub     rsp, 24
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 0
        jne     .L2
        mov     eax, 1
        jmp     .L3
.L2:
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 1
        mov     edi, eax
        call    foo(int)
        mov     ebx, eax
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 2
        mov     edi, eax
        call    foo(int)
        add     eax, ebx
.L3:
        mov     rbx, QWORD PTR [rbp-8]
        leave
        ret
main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     DWORD PTR [rbp-20], edi
        mov     QWORD PTR [rbp-32], rsi
        mov     eax, DWORD PTR [rbp-20]
        mov     edi, eax
        call    foo(int)
        mov     DWORD PTR [rbp-4], eax
        mov     eax, DWORD PTR [rbp-4]
        leave
        ret

O3

foo(int):
        test    edi, edi
        je      .L44
        push    r15
        push    r14
        xor     r14d, r14d
        push    r13
        push    r12
        lea     r12d, [rdi-1]
        push    rbp
        mov     ebp, r12d
        push    rbx
        sub     rsp, 56
.L7:
        test    ebp, ebp
        je      .L45
.L3:
        lea     r13d, [rbp-1]
        xor     ebx, ebx
        mov     eax, r14d
        mov     r15d, ebp
.L10:

        lea     ebx, [r15-1]
 .... // much much more to follow

Os

foo(int):
        push    rbp
        xor     ebp, ebp
        push    rbx
        mov     ebx, edi
        push    rcx
.L3:
        test    ebx, ebx
        je      .L5
        lea     edi, [rbx-1]
        sub     ebx, 2
        call    foo(int)
        add     ebp, eax
        jmp     .L3
.L5:
        lea     eax, [rbp+1]
        pop     rdx
        pop     rbx
        pop     rbp
        ret
main:
        jmp     foo(int)
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Phadyer
  • 45
  • 4
  • 4
    *maybe the O3 tried to boost the executive speed in runtime at the cost of size.* That happens all the time. One main example of this is loop unrolling. It takes up more space, but you have less branches, increments. – NathanOliver Feb 28 '23 at 15:20
  • 1
    There are many optimization which give you a boost in time in exchange for memory (executable size). For example less branches in code leads to faster execution, but may lead to repetition of code. – Marek R Feb 28 '23 at 15:21
  • 4
    what is your question? – 463035818_is_not_an_ai Feb 28 '23 at 15:23
  • 2
    Unrelated: Your condition `i == 0` should be `i <= 0` otherwise your code will recurse infinitely (because `foo(1)` will try `foo(0) + foo(-1)`, and so on) – Fareanor Feb 28 '23 at 15:24
  • BTW `inline` has different meaning then you think. Nowadays it prevents error about repeating definition and do not have impact on performance. – Marek R Feb 28 '23 at 15:24
  • related / duplicate https://stackoverflow.com/questions/19689014/gcc-difference-between-o3-and-os – 463035818_is_not_an_ai Feb 28 '23 at 15:24
  • 1
    `constexpr` functions are always implicitly `inline`. So `constexpr inline` is redundant. – user17732522 Feb 28 '23 at 15:26
  • @Fareanor I don't its unrelated. Not at all. I'd expect this to have impact on optimizations (but compiler explorer is down :/) – 463035818_is_not_an_ai Feb 28 '23 at 15:27
  • @463035818_is_not_a_number You're probably right, I went too quick into _code only analysis mode_ XD That being said, [compiler explorer is quite buggy](https://godbolt.org/z/TexdqhTnn) these days (around 80% of the available compilers disappeared from the drop down menu if you open directly an execution only window) – Fareanor Feb 28 '23 at 15:34

1 Answers1

2

-O3 changes the settings / thresholds for code-size vs. guesstimated "speed" heuristics, making GCC willing to inline more aggressively than at -O2.

Using the Godbolt compiler explorer's tree-dump pane for your code with GCC12.2 -O3, it looks like the "inlining" GCC optimization pass is the one that really inflates the size of its intermediate representation (GIMPLE).

GCC optimizes in terms of GIMPLE, an SSA representation of your program logic. The tree-dump options print some status info of what that optimization pass decided to do, and a C representation of the current state of the program logic, with stuff like extra local variables to represent invented temporaries. There are lots of them, and lots of code, after the inlining pass, vs. before that there wasn't much more code than in the original source.

The inlining status info includes some cost-model evaluations (code-size vs. speed), and also include things like this which confirm exactly what it's doing:

Performing recursive inlining on constexpr int foo(int)/0

GCC will try hard to turn recursion into iteration, but this function is doubly-recursive so it can only do that for one of the recursions. (It isn't smart enough to optimize it from O(Fib(n)) time to O(n) time by realizing that fib(i-2) last iteration is also fib(i-1) this iteration. Even telling GCC it's a pure function of its inputs, not reading any global state (__attribute__((const))) doesn't help. Changing the base-case to catch i<=1 didn't help. Your version will always cause infinite recursion as one call gets i==0 and the other gets i==-1, leading to eventual UB of INT_MIN-2 if you don't overflow the stack first, but apparently that wasn't what was stopping GCC from optimizing.)

So anyway, GCC is able to turn it into a function that calls itself in a loop, rather than doubly-recursive. And it recursively inlines into that loop, making the loop body large. But with fewer call/ret executions for the same n.


One relevant question is whether this ends up running any faster for large n (like 10k or so, Fib(n) will definitely overflow so prefer unsigned i). i.e. whether this optimization was potentially useful or if GCC just wasted a bunch of code for nothing.

Obviously if you wanted to compute Fib(n) fast, you wouldn't have written this source in the first place, but doubly-recursive functions are normal for stuff like traversing a binary tree. This code works fine to see how GCC handles a doubly-recursive function, though.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Quite so, I conduct this small experiment just to figure out how exactly does the compiler deal with recursive function call under different opt-level, please just ignore the wrong limit condition ...... – Phadyer Mar 05 '23 at 16:17