This is/was a missed optimization in GCC10.2 and earlier. It seems to already be fixed in current nightly builds of GCC, so no need to report a missed-optimization bug on GCC's bugzilla. (https://gcc.gnu.org/bugzilla/). It looks like it first appeared as a regression from GCC4.8 to GCC4.9. (Godbolt)
# GCC11-dev nightly build
# actually *better* than "good", avoiding a mov-elimination missed opt.
bad(int, short):
movzx esi, si # mov-elimination never works for 16->32 movzx
mov eax, edi # mov-elimination works between different regs
sal rsi, 32
or rax, rsi
ret
Yes, you'd generally expect C++ that implements the same logic basically the same way to compile to the same asm, as long as optimization is enabled, or at least hope so1. And generally you can hope that there are no pointless missed optimizations that waste instructions for no apparent reason (rather than simply picking a different implementation strategy), but unfortunately that's not always true either.
Writing different parts of the same object and then reading the whole object is tricky for compilers in general so it's not a shock to see different asm when you write different parts of the full object in a different order.
Note that there's nothing "smart" about the bad
asm, it's just doing a redundant mov
instruction. Having to take input in fixed registers and produce output in another specific hard register to satisfy the calling convention is something GCC's register allocator isn't amazing at: wasted mov
missed optimizations like this are more common in tiny functions than when part of a larger function.
If you're really curious, you could dig into the GIMPLE and RTL internal representations that GCC transformed through to get here. (Godbolt has a GCC tree-dump pane to help with this.)
Footnote 1: Or at least hope that, but missed-optimization bugs do happen in real life. Report them when you spot them, in case it's something that GCC or LLVM devs can easily teach the optimizer to avoid. Compilers are complex pieces of machinery with multiple passes; often a corner case for one part of the optimizer just didn't used to happen until some other optimization pass changed to doing something else, exposing a poor end result for a case the author of that code wasn't thinking about when writing / tweaking it to improve some other case.
Note that there's no Undefined Behaviour here despite the complaints in comments: The GNU dialect of C and C++ defines the behaviour of union type-punning in C89 and C++, not just in C99 and later like ISO C does. Implementations are free to define the behaviour of anything that ISO C++ leaves undefined.
Well technically there is a read-uninitialized because the upper 2 bytes of the void*
object haven't been written yet in pack p
. But fixing it with pack p = {.ptr=0};
doesn't help. (And doesn't change the asm; GCC happened to already zero the padding because that's convenient).
Also note, both versions in the question are less efficient than possible:
(The bad
output from GCC4.8 or GCC11-trunk avoiding the wasted mov
looks optimal for that choice of strategy.)
mov edi,edi
defeats mov-elimination on both Intel and AMD, so that instruction has 1 cycle latency instead of 0, and costs a back-end µop. Picking a different register to zero-extend into would be cheaper. We could even pick RSI after reading SI, but any call-clobbered register would work.
hand_written:
movzx eax, si # 16->32 can't be eliminated, only 8->32 and 32->32 mov
shl rax, 32
mov ecx, edi # zero-extend into a different reg with 0 latency
or rax, rcx
ret
Or if optimizing for code-size or throughput on Intel (low µop count, not low latency), shld
is an option: 1 µop / 3c latency on Intel, but 6 µops on Zen (also 3c latency, though). (https://uops.info/ and https://agner.org/optimize/)
minimal_uops_worse_latency: # also more uops on AMD.
movzx eax, si
shl rdi, 32 # int32 bits to the top of RDI
shld rax, rdi, 32 # shift the high 32 bits of RDI into RAX.
ret
If your struct was ordered the other way, with the padding in the middle, you could do something involving mov ax, si
to merge into RAX. That could be efficient on non-Intel, and on Haswell and later which don't do partial-register renaming except for high-8 regs like AH.
Given the read-uninitialized UB, you could just compile it to literally anything, including ret
or ud2
. Or slightly less aggressive, you could compile it to just leave garbage for the padding part of the struct, the last 2 bytes.
high_garbage:
shl rsi, 32 # leaving high garbage = incoming high half of ESI
mov eax, edi # zero-extend into RAX
or rax, rsi
ret
Note that an unofficial extension to the x86-64 System V ABI (which clang actually depends on) is that narrow args are sign- or zero-extended to 32 bits. So instead of zeros, the high 2 bytes of the pointer would be copies of the sign bit. (Which would actually guarantee that it's a canonical 48-bit virtual address on x86-64!)