Remove needless assembler statements from g++ output

Question

I am investigating some problem with a local binary. I've noticed that g++ creates a lot of ASM output that seems unnecessary to me. Example with -O0:

Derived::Derived():
    pushq   %rbp
    movq    %rsp, %rbp
    subq    $16, %rsp          <--- just need 8 bytes for the movq to -8(%rbp), why -16?
    movq    %rdi, -8(%rbp)
    movq    -8(%rbp), %rax
    movq    %rax, %rdi         <--- now we have moved rdi onto itself.
    call    Base::Base()
    leaq    16+vtable for Derived(%rip), %rdx
    movq    -8(%rbp), %rax     <--- effectively %edi, does not point into this area of the stack
    movq    %rdx, (%rax)       <--- thus this wont change -8(%rbp)
    movq    -8(%rbp), %rax     <--- so this statement is unnecessary
    movl    $4712, 12(%rax)
    nop
    leave
    ret

option -O1 -fno-inline -fno-elide-constructors -fno-omit-frame-pointer:

Derived::Derived():
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %rbx
    subq    $8, %rsp       <--- reserve some stack space and never use it.
    movq    %rdi, %rbx
    call    Base::Base()
    leaq    16+vtable for Derived(%rip), %rax
    movq    %rax, (%rbx)
    movl    $4712, 12(%rbx)
    addq    $8, %rsp       <--- release unused stack space.
    popq    %rbx
    popq    %rbp
    ret

This code is for the constructor of Derived that calls the Base base constructor and then overrides the vtable pointer at position 0 and sets a constant value to an int member it holds in addition to what Base contains.

Question:

Can I translate my program with as few optimizations as possible and get rid of such stuff? Which options would I have to set? Or is there a reason the compiler cannot detect these cases with -O0 or -O1 and there is no way around them?
Why is the subq $8, %rsp statement generated at all? You cannot optimize in or out a statement that makes no sense to begin with. Why does the compiler generate it then? The register allocation algorithm should never, even with O0, generate code for something that is not there. So why it is done?

AFAIK, the compiler does this _specifically_ to improve your debugging experience. — Max Langhof, Sep 27 '19 at 09:40
How does a cyclic move without an effect improve the debugging experience. Please elaborate. — hochl, Sep 27 '19 at 09:55
The debugger can look into `-8(%rbp)` to see the value `%rdi` (some local variable?) had, even if `%rdi` is reused later (to hold some other local variable). At least if I'm interpreting the assembly correctly (I'm not used to this syntax). It's also trivial for a debugger to change this value at that point because it is read back again. — Max Langhof, Sep 27 '19 at 09:57
`%rdi` is effectively the pointer to the `Derived` object further up the stack. The function has no local variables (or should not have -- who knows what g++ does internally). — hochl, Sep 27 '19 at 10:24
To improve *C* debugging. If you're reading / debugging the asm, use at least `-Og`. And BTW, your function is a non-static class member function, so it has one implicit arg: `this` in `rdi`. Which g++ spills to the stack because of `-O0`. See [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](//stackoverflow.com/q/53366394) for why. — Peter Cordes, Sep 27 '19 at 10:49
I was deciding whether to answer or close it as a duplicate of something (e.g. about 16-byte stack alignment: that `sub $8, %rsp` is required by the ABI). I'll probably write an answer around the links since it would take a lot of comments to explain why other questions answer this. — Peter Cordes, Sep 27 '19 at 10:52
Please add some in-depth information because that stuff really freaks me out since a long time. Some compiler internals would be nice, in case you know them. Thanks! Funny side note: the question you linked is similar and received the same amount of downvotes. It's no duplicate tho because it's about clang (the problem domain is the same ofc). — hochl, Sep 27 '19 at 10:54
If your question had just been about the `-O0` output, I would have marked it as a duplicate and commented that g++ (and ICC and MSVC) al behave the same as clang as far as intentionally spilling all C/C++ variables between statements. And my answer there explains why. But your question also mixes in some specific claims about instructions you think aren't necessary in the `-O1` output, and asks for a minimum optimization option for non-garbage human-readable asm. — Peter Cordes, Sep 27 '19 at 11:24
Have you tried higher optimisation levels? You specifically enable little (`-O1`) or no (`-O0`) optimisation and then wonder why the code is poorly optimised. — fuz, Sep 27 '19 at 12:01

Paul Evans · Answer 1 · 2019-09-27T10:09:28.947

3

is there a reason the compiler cannot detect these cases with -O0 or -O1

exactly because you're telling the compiler not to. These are optimisation levels that need to be turn off or down for proper debugging. You're also trading off compilation time for run-time.

You're looking through the telescope the wrong way, check out the awesome optimisations that you're compiler will do for you when you crank up optimisation.

edited Sep 27 '19 at 10:09

answered Sep 27 '19 at 09:42

Paul Evans

27,315
3
37
54

so with `-O1` what could the reason for `subq $8, %rsp` possibly be? It's not used, so I would expect the compiler to never generate this statement even with no optimizations. You cannot optimize in or out a statement that has no reason in the first place. – hochl Sep 27 '19 at 09:43
2

@hochl And the experiment shows your assumption is wrong. What is your question? – YSC Sep 27 '19 at 09:46
i think i have laid out the question pretty clearly. The compiler generates code for something that does not happen. The register allocation algorithm should never, even with O0, generate code for something that is not there. – hochl Sep 27 '19 at 09:52
@hochl You're basically telling the compiler not to bother with no/low optimisation levels. So it doesn't. Turn the levels up and it will start to spot these things that you have and generate better/faster code. It's basically absurd to call the compiler out for not doing things that don't affect correctness at these optimisation levels. – Paul Evans Sep 27 '19 at 09:59
4

@hochl The compiler's job is cutting down abstractions specified on the abstract machine. But at low optimization levels it will leave more of the abstractions in there, which may simply expose themselves as suboptimal assembly (e.g. keeping "variables" around even after they are no longer needed) - no surprise, because you didn't _ask_ for optimal assembly. At low optimization levels it will also do certain things to make debugging symbols more meaningful and help debuggers (such as keeping around the values of variables that are still in scope but not used afterwards). – Max Langhof Sep 27 '19 at 10:00
Maybe my wording must be improved, I just don't know how to phrase it better. Maybe SO just isn't the place to ask a technical compiler construction question in the first place. For example, the `subq ...` of 8 bytes has no obvious use since this function *does not have any local variables whatsoever* (or at least it should not) besides the argument in `%rdi`. It basically just initializes the vtable pointer and pushes a value. So even with `-O0` there is *no evident reason this code would ever get produced, even for debugging reasons*. Please supply an example if you exactly can say why. – hochl Sep 27 '19 at 10:30
2

There are many reasons to introduce this for debugging purposes. OTOH it's a useful canary (albeit not in this context), I'm sure others with more debugging experience can provide a better answer to that. I don't think it's reasonable however to grumble too much about this after setting O0 or even O1. You're not asking for very aggressive optimisation at all, so you won't get it. You've found an instance where people will struggle to justify this extra stack space, but there will be instances where it's much easier to justify it. – Harrand Sep 27 '19 at 10:40
1

that's kind of the point -- even without optimization this should not be part of any output. How can the compiler create it at all? It's not that `-O0` means `please add some bogus code that does nothing for an entity that is not there`. I guess the question really needs deep knowledge of the internals of G++ to answer :( Btw, I tried with `clang++` and the situation is even worse. – hochl Sep 27 '19 at 10:45
2

It's not "please add some bogus code that does nothing for an entity that is not there" at play, it's "hey, you know those standard techniques you use which may look utterly useless in certain cases? I don't care whether you bother to remove them or not in the cases where it achieves nothing" This doesn't require deep knowledge of compiler implementations, it requires that you understand there are general practices performed by code generators to aid debugging. If you say you don't care if they're removed when unneeded, they won't be removed when unneeded. This may need to be moved to chat – Harrand Sep 27 '19 at 10:52
that one started in chat but the replies were not satisfying for me. – hochl Sep 27 '19 at 11:51
1

@Harrand and Paul: unfortunately this answer is pretty misleading. Some of those claimed "missed optimizations", notably aligning the stack by 16, are actually required. The only real boilerplate that is always present in debug builds is setting up RBP as a frame pointer (which the OP forcibily re-enabled at `-O1`). And the "always compile this way" for debugging stuff is just -O0 spill everything after every statement. So again, fails to explain the `sub $8,%rsp` that @ hochl has been arguing about in comments. – Peter Cordes Sep 27 '19 at 12:25
@PeterCordes I never mention "missed optimizations". I'm saying that you have a scale that starts at fastest compilation time, slowest run-time execution, and best support for debugging that moves onto slowest compilation time, fastest run-time execution, and worst support for debugging (ignoring size of generated code). – Paul Evans Sep 27 '19 at 14:31
@PaulEvans: *I never mention "missed optimizations"* right, so you weren't answering the question. Which was about why the compiler inserted `sub $r8, %rsp` even with some optimization enabled, i.e. created apparently sub-optimal, aka missed optimization. Your answer kind of implies everything the question was asking about would go away at `-O2` or `-O3`. Or else it leaves the question mostly unanswered. The part of the question that's worth answering is in those details. – Peter Cordes Sep 27 '19 at 16:12
It wasn't the clearest question, significant noise. But the `subq %rsp` was one of the two bullet points at the bottom. The OP also should have tried `-O3` and seen that it still didn't go away then. But confused people in over their heads typically miss a lot of stuff that's obvious (even to them in hindsight). I looked at the question again but I don't see any small edits that would help. A total rewrite isn't really warranted, and I don't want to do that anyway. – Peter Cordes Sep 27 '19 at 17:33

Peter Cordes · Accepted Answer · 2019-09-27T12:28:42.850

I don't see any obvious missed optimizations in your -O1 output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer so clearly you know why GCC didn't optimize that away.

The function has no local variables

Your function is a non-static class member function, so it has one implicit arg: this in rdi. Which g++ spills to the stack because of -O0. Function args count as local variables.

How does a cyclic move without an effect improve the debugging experience. Please elaborate.

To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo: that keyword does affect debug-mode code gen).

Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.

Which options would I have to set?

If you're reading / debugging the asm, use at least -Og or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0. Preferably -O2 or -O3 unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og or -O1 will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.

How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.

Loading into RAX and then movq %rax, %rdi is just a side-effect of -O0. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0 is to compile quickly, as well as consistent debugging.

Why is the subq $8, %rsp statement generated at all?

Because the ABI requires 16-byte stack alignment before a call instruction, and this function did an even number of 8-byte pushes. (call itself pushes a return address). It will go away at -O1 without -fno-omit-frame-pointer because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.

Why does System V / AMD64 ABI mandate a 16 byte stack alignment?

Fun fact: clang will often use a dummy push %rcx/pop or something, depending on -mtune options, instead of an 8-byte sub.

If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?

In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly

Remove needless assembler statements from g++ output

2 Answers2