Why would gcc -O3 generate multiple ret instructions?

Question

I was looking at some recursive function from here:

int get_steps_to_zero(int n)
{
    if (n == 0) {
        // Base case: we have reached zero
        return 0;
    } else if (n % 2 == 0) {
        // Recursive case 1: we can divide by 2
        return 1 + get_steps_to_zero(n / 2);
    } else {
        // Recursive case 2: we can subtract by 1
        return 1 + get_steps_to_zero(n - 1);
    }
}

I checked the disassembly in order to check if gcc managed tail-call optimization/unrolling. Looks like it did, though with x86-64 gcc 12.2 -O3 I get a function like this, ending with two ret instructions:

get_steps_to_zero:
        xor     eax, eax
        test    edi, edi
        jne     .L5
        jmp     .L6
.L10:
        mov     edx, edi
        shr     edx, 31
        add     edi, edx
        sar     edi
        test    edi, edi
        je      .L9
.L5:
        add     eax, 1
        test    dil, 1
        je      .L10
        sub     edi, 1
        test    edi, edi
        jne     .L5
.L9:
        ret
.L6:
        ret

Godbolt example.

What's the purpose of the multiple returns? Is it a bug?

EDIT

Seems like this appeared from gcc 11.x. When compiling under gcc 10.x, then the function ends like:

.L1:
        mov     eax, r8d
        ret
.L6:
        xor     r8d, r8d
        mov     eax, r8d
        ret

As in: store result in eax. The 11.x version instead zeroes eax in the beginning of the function then modifies it in the function body, eliminating the need for the extra mov instruction.

Looks like a missed optimisation to me. Another possible reason is that the two `ret` instructions correspond to different source lines. Keeping them separate might permit more precise debugging information. — fuz, Dec 06 '22 at 10:25
@fuz If so then why not 3 `ret` corresponding to the 3 `return` in the original C source? — Lundin, Dec 06 '22 at 10:27
I don't know, I have not seen your original code as you had not added it to your question. — fuz, Dec 06 '22 at 10:30
@fuz Added. I also found out that this seems to be because of some optimization introduced in gcc 11. — Lundin, Dec 06 '22 at 10:38
possible duplicate: [Why does GCC emit a repeated `ret`?](https://stackoverflow.com/q/62556736/995714) — phuclv, Dec 06 '22 at 16:33
@phuclv Not so sure about that. That question is from 2020, gcc 11 was released 2021 after which the behavior changed. — Lundin, Dec 07 '22 at 07:18
I have posted an answer that addresses your question specifically, and the reason in the linked question is the same. Are you satisfied with my explanation? — amonakov, Dec 08 '22 at 13:05
@amonakov I came to the same conclusion in the edit above where I compiled with an older gcc. One label would set eax to zero there too. I was mostly curious about why, since the code from my edit could as well just have set the L1 label at the second `mov eax, r8d` row. But apparently this is a known bug. — Lundin, Dec 08 '22 at 13:58
@amonakov I believe the answer in the duplicate https://stackoverflow.com/a/62557809/584518 answers the question — Lundin, Dec 08 '22 at 14:27

score 3 · Answer 1 · answered Dec 06 '22 at 17:55

This is a manifestation of pass ordering problem. At some point in the optimization pipeline, the two basic blocks ending in ret are not equivalent, then some pass makes them equivalent, but no following pass is capable of collapsing the two equivalent blocks into one.

On Compiler Explorer, you can see how compiler optimization pipeline works by inspecting snapshots of internal representation between passes. For GCC, select "Add New > GCC Tree/RTL" in the compiler pane. Here's your example, with a snapshot immediately preceding the problematic transformation pre-selected in the new pane: https://godbolt.org/z/nTazM5zGG

Towards the end of the dump, you can see the two basic blocks:

   65: NOTE_INSN_BASIC_BLOCK 8
   77: use ax:SI
   66: simple_return

and

  43: NOTE_INSN_BASIC_BLOCK 9
    5: ax:SI=0
   38: use ax:SI
   74: NOTE_INSN_EPILOGUE_BEG
   75: simple_return

Basically the second block is different in that it sets eax to zero before returning. If you look at the next pass (called "jump2"), you see that it lifts the ax:SI=0 instruction from basic block 9 and basic block 3 to basic block 2, making BB 9 equivalent to BB 8.

If you disable this optimization with -fno-crossjumping, the difference will be carried to the end, making the resulting assembly less surprising.

iBug · Answer 2 · 2022-12-06T13:08:51.967

-2

Conclusion first: This is a deliberate optimization choice by GCC.

If you use GCC locally (gcc -O3 -S) instead of on Godbolt, you can see that there are alignment directives between the two ret instructions:

; top part omitted
.L9:
        ret
        .p2align 4,,10
        .p2align 3
.L6:
        ret
        .cfi_endproc

The object file, when disassembled, includes an NOP in that padding area:

   8:   75 13                   jne    1d <get_steps_to_zero+0x1d>
   a:   eb 24                   jmp    30 <get_steps_to_zero+0x30>
   c:   0f 1f 40 00             nopl   0x0(%rax)
<...>
  2b:   75 f0                   jne    1d <get_steps_to_zero+0x1d>
  2d:   c3                      ret
  2e:   66 90                   xchg   %ax,%ax
  30:   c3                      ret

The second ret instruction is aligned to a 16-byte boundary whereas the first one isn't. This allows the processor to load the instruction faster when used as a jump target from a distant source. Subsequent C return statements, however, are close enough to the first ret instruction such that they will not benefit from jumping to aligned targets.

This alignment is even more noticeable on my Zen 2 CPU with -mtune=native, with more padding bytes added:

  29:   75 f2                   jne    1d <get_steps_to_zero+0x1d>
  2b:   c3                      ret
  2c:   0f 1f 40 00             nopl   0x0(%rax)
  30:   c3                      ret

edited Dec 06 '22 at 13:08

answered Dec 06 '22 at 12:42

iBug

35,554
7
89
134

2

`ret` is 1 byte long, and unconditionally jumps. It's not useful to load later instruction bytes after it, and any code-fetch of any chunk that includes it will get the whole instruction. There's no benefit to having it be aligned. Also, if this reasoning were true, we'd expect the earlier `jmp .L6` to also target the "aligned" `ret`, with the unaligned one only reached via fall-through. – Peter Cordes Dec 06 '22 at 12:44
@PeterCordes `.L6` *is* the aligned `ret`. Assuming you meant the other way, it's because the `jmp .L9` instruction is close enough (<16 bytes) to the unaligned `ret` such that it will have been loaded by the time `jmp` enters the execution stage, so there's no need to load a more distant one only to serve as the jump target. – iBug Dec 06 '22 at 12:51
Well if you compile the godbolt example I posted to binary it inserts `xchg ax,ax` between the `ret` which iirc is Intelese for "nop". Also it ends with these two lines `cs nop WORD PTR [rax+rax*1+0x0]`, `nop DWORD PTR [rax+rax*1+0x0]`. – Lundin Dec 06 '22 at 12:56
But I don't understand why alignment would be an argument for having two `ret` anyway. If there was a need for such you would just use one `ret` and pad everything else with `nop`. – Lundin Dec 06 '22 at 12:58
Yes, I got that backwards, you're correct about what I meant to say. Interesting hypothesis about being within 16 bytes. I don't think GCC actually counts bytes, though; it just prints asm text without knowing the length in bytes. That's part of why [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) is only via an assembler option, not a compiler option. Also, jumping into a 16-byte chunk that's already in the pipeline doesn't actually help, AFAIK, except insofar as it means those bytes are definitely hot in I-cache. – Peter Cordes Dec 06 '22 at 12:59
Still, it's *possible* GCC devs had the same within-16-bytes idea and wrote code to do it this way on purpose, even though I don't think it's helpful on real CPUs. More likely I think it's just a missed optimization. Or left over from some CPUs where it maybe helped branch prediction to have different paths out of the function use a different `ret` instruction? Unlikely, since call/ret prediction is normally handled via special predictors. I'm pretty sure this is just a missed optimization, though. – Peter Cordes Dec 06 '22 at 13:02
If anyone's familiar enough with GCC internals, we could confirm this answer by finding the code inside GCC which checks for jump target range and invents a new aligned epilogue block if it's farther than 16 bytes. But it's hard to prove a negative this way; there's a *lot* of code to dig through to prove that no such code exists. – Peter Cordes Dec 06 '22 at 13:10
@PeterCordes Another (wild) guess is GCC not optimizing across basic blocks in the lowest-level IR. – iBug Dec 06 '22 at 13:14
@iBug: Yeah, that was my guess. That it doesn't check for them being the same after whatever step made them the same, only at some earlier point where they returned different source variables, perhaps in different registers. Perhaps only looking in the GIMPLE, or too early in RTL. This is a common missed optimization for GCC. I guess another way to rule out the within-16-byte hypothesis would be to find a case where different more-distant branches also target redundant tail blocks. – Peter Cordes Dec 06 '22 at 13:22
1

@PeterCordes Still doesn't explain gcc behavior up to version 10.x.x, see edit of the question. Or check this https://godbolt.org/z/48dPjv9Pf. Why didn't they just skip the part below L1 and move the L1 label to the final `mov eax, r8d`? It's the same strange code just a different flavour of it. I think part of the answer to the question lies in some changes in code generation/optimization in version 11.x.x. – Lundin Dec 06 '22 at 15:02

Why would gcc -O3 generate multiple ret instructions?

2 Answers2

Linked