3

This experiment was done using GCC 6.3. There are two functions where the only difference is in the order we assign the i32 and i16 in the struct. We assumed that both functions should produce the same assembly. However this is not the case. The "bad" function produces more instructions. Can anyone explain why this happens?

#include <inttypes.h>

union pack {
    struct {
        int32_t i32;
        int16_t i16;
    };
    void *ptr;
};
static_assert(sizeof(pack)==8, "what?");

void *bad(const int32_t i32, const int16_t i16) { 
    pack p;
    p.i32 = i32;
    p.i16 = i16;
    return p.ptr;
}

void *good(const int32_t i32, const int16_t i16) { 
    pack p;
    p.i16 = i16;
    p.i32 = i32;
    return p.ptr;
}

...

bad(int, short):
        movzx   eax, si
        sal     rax, 32
        mov     rsi, rax
        mov     eax, edi
        or      rax, rsi
        ret
good(int, short):
        movzx   eax, si
        mov     edi, edi
        sal     rax, 32
        or      rax, rdi
        ret

The compiler flags were -O3 -fno-rtti -std=c++14

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
chasep255
  • 11,745
  • 8
  • 58
  • 115
  • did you turn on optimizations? What compiler flags did you use? Is number of intructions your only indicator for "bad" vs "good" ? – 463035818_is_not_an_ai Oct 15 '20 at 13:26
  • 3
    afaik both have ub, so what the compiler generates isnt that relevant actually – 463035818_is_not_an_ai Oct 15 '20 at 13:27
  • 2
    in C++ type punning via unions is not allowed (in C it is). See here: https://stackoverflow.com/questions/25664848/unions-and-type-punning – 463035818_is_not_an_ai Oct 15 '20 at 13:30
  • 3
    @idclev463035818: [GCC / `g++` defines the behaviour of union type-punning in C89 and C++](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html), not just ISO C99. So there's no UB here on the compiler the OP is using. Remember that implementations are free to define any behaviour that ISO C++ leaves undefined, such as `gcc -fwrapv` even defining signed integer overflow. – Peter Cordes Oct 15 '20 at 13:35
  • @PeterCordes ok, good to know. I'll leave the comments, others might stumble upon the same – 463035818_is_not_an_ai Oct 15 '20 at 13:36
  • Even if one interprets the code as if it were C, not C++, modulo the difference (in C) between types `pack` and `union pack`, the assignments to the 32-bit and 16-bit pieces are performed in a different order by the two functions, so I don't see on what basis one would assume identical assembly. – John Bollinger Oct 15 '20 at 13:39
  • @JohnBollinger: I thought the same thing at first, but note that it's a `struct` inside the union, so the assignments are in fact independent, to non-overlapping bytes in the union/struct object that's later read as a `void*`. Also note that this is GNU C++ so no UB, see my previous comment. – Peter Cordes Oct 15 '20 at 13:41
  • Yes, @PeterCordes. I'm not saying that reordering the assignments in the assembly *vs*. the C or C++ source would be wrong, but rather that a direct transliteration into assembly would not produce identical code for the two, so it's unclear why the OP assumes that the compiler definitively *should* produce the same (optimized) assembly. – John Bollinger Oct 15 '20 at 13:48
  • @JohnBollinger: Oh. Because `g++ -O3` should see that the final result is the same and optimize them to the same code. `-O3` is the opposite of a direct transliteration to asm. It's certainly obvious that `-O0` would produce different asm for a different statement order. – Peter Cordes Oct 15 '20 at 13:50
  • 1
    @PeterCordes, "should" is not the word that I think most appropriate here, and that is my point. The compiler *could* emit identical assembly for the two cases, and it is reasonable to hope that it would do, but if in fact it doesn't then what kind of answer does the OP want? "It could, but it just doesn't" or "it does produce identical assembly if you [...]" or even "here's the relevant code of GCC's optimizer"? What is the unsound assumption they are making that needs to be addressed? – John Bollinger Oct 15 '20 at 13:58
  • @JohnBollinger maybe that assumption is that "bad" is bad and "good" is good. At least thats what OP and their compiler seem to disagree about – 463035818_is_not_an_ai Oct 15 '20 at 14:10
  • @JohnBollinger: Ok, I see what you're getting at. But I think you're unnecessarily picking on the OP's choice of the word "assume". It's a fine question if we just take it as "why is there a difference?" And yes, as usual the right answer is "this is just a missed optimization, not significant and no way to predict which one would be better without checking the asm." If anyone wants to find the relevant GCC change that fixed it, that might be fun but probably require a lot of background knowledge of GCC internals to actually understand. – Peter Cordes Oct 15 '20 at 14:41

1 Answers1

4

This is/was a missed optimization in GCC10.2 and earlier. It seems to already be fixed in current nightly builds of GCC, so no need to report a missed-optimization bug on GCC's bugzilla. (https://gcc.gnu.org/bugzilla/). It looks like it first appeared as a regression from GCC4.8 to GCC4.9. (Godbolt)

# GCC11-dev nightly build
# actually *better* than "good", avoiding a mov-elimination missed opt.
bad(int, short):
        movzx   esi, si          # mov-elimination never works for 16->32 movzx
        mov     eax, edi         # mov-elimination works between different regs
        sal     rsi, 32
        or      rax, rsi
        ret

Yes, you'd generally expect C++ that implements the same logic basically the same way to compile to the same asm, as long as optimization is enabled, or at least hope so1. And generally you can hope that there are no pointless missed optimizations that waste instructions for no apparent reason (rather than simply picking a different implementation strategy), but unfortunately that's not always true either.

Writing different parts of the same object and then reading the whole object is tricky for compilers in general so it's not a shock to see different asm when you write different parts of the full object in a different order.

Note that there's nothing "smart" about the bad asm, it's just doing a redundant mov instruction. Having to take input in fixed registers and produce output in another specific hard register to satisfy the calling convention is something GCC's register allocator isn't amazing at: wasted mov missed optimizations like this are more common in tiny functions than when part of a larger function.

If you're really curious, you could dig into the GIMPLE and RTL internal representations that GCC transformed through to get here. (Godbolt has a GCC tree-dump pane to help with this.)

Footnote 1: Or at least hope that, but missed-optimization bugs do happen in real life. Report them when you spot them, in case it's something that GCC or LLVM devs can easily teach the optimizer to avoid. Compilers are complex pieces of machinery with multiple passes; often a corner case for one part of the optimizer just didn't used to happen until some other optimization pass changed to doing something else, exposing a poor end result for a case the author of that code wasn't thinking about when writing / tweaking it to improve some other case.


Note that there's no Undefined Behaviour here despite the complaints in comments: The GNU dialect of C and C++ defines the behaviour of union type-punning in C89 and C++, not just in C99 and later like ISO C does. Implementations are free to define the behaviour of anything that ISO C++ leaves undefined.

Well technically there is a read-uninitialized because the upper 2 bytes of the void* object haven't been written yet in pack p. But fixing it with pack p = {.ptr=0}; doesn't help. (And doesn't change the asm; GCC happened to already zero the padding because that's convenient).


Also note, both versions in the question are less efficient than possible:

(The bad output from GCC4.8 or GCC11-trunk avoiding the wasted mov looks optimal for that choice of strategy.)

mov edi,edi defeats mov-elimination on both Intel and AMD, so that instruction has 1 cycle latency instead of 0, and costs a back-end µop. Picking a different register to zero-extend into would be cheaper. We could even pick RSI after reading SI, but any call-clobbered register would work.

hand_written:
    movzx  eax, si    # 16->32 can't be eliminated, only 8->32 and 32->32 mov
    shl    rax, 32
    mov    ecx, edi   # zero-extend into a different reg with 0 latency
    or     rax, rcx
    ret

Or if optimizing for code-size or throughput on Intel (low µop count, not low latency), shld is an option: 1 µop / 3c latency on Intel, but 6 µops on Zen (also 3c latency, though). (https://uops.info/ and https://agner.org/optimize/)

minimal_uops_worse_latency:  # also more uops on AMD.
    movzx  eax, si
    shl    rdi, 32              # int32 bits to the top of RDI
    shld   rax, rdi, 32         # shift the high 32 bits of RDI into RAX.
    ret

If your struct was ordered the other way, with the padding in the middle, you could do something involving mov ax, si to merge into RAX. That could be efficient on non-Intel, and on Haswell and later which don't do partial-register renaming except for high-8 regs like AH.


Given the read-uninitialized UB, you could just compile it to literally anything, including ret or ud2. Or slightly less aggressive, you could compile it to just leave garbage for the padding part of the struct, the last 2 bytes.

high_garbage:
    shl    rsi, 32    # leaving high garbage = incoming high half of ESI
    mov    eax, edi   # zero-extend into RAX
    or     rax, rsi
    ret

Note that an unofficial extension to the x86-64 System V ABI (which clang actually depends on) is that narrow args are sign- or zero-extended to 32 bits. So instead of zeros, the high 2 bytes of the pointer would be copies of the sign bit. (Which would actually guarantee that it's a canonical 48-bit virtual address on x86-64!)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847