Near duplicate, I thought I'd already answered this but didn't find the Q&A right away: Why does the short (16-bit) variable mov a value to a register and store that, unlike other widths?
This question also has a separate missed optimization when looking at the two 8-bit assignments, not one 16-bit integers. Also, update: this GCC12 regression is already fixed in GCC trunk; sorry I forgot to check that before suggesting that you report it upstream.
It's avoiding Length-Changing Prefix (LCP) stalls for 16-bit immediates, but those don't exist for mov
on Sandybridge and later so it should stop doing that for tune=generic
:P You're correct, movw $16, ldap(%rip)
would be optimal. That's what GCC uses when tuning for non-Intel uarches like -mtune=znver3
. Or at least what older versions did which didn't have the other missed optimization of loading from .rodata
.
It's insane that it's loading a 16
from the .rodata
section instead of using it as an immediate. The movzwl
load with a RIP-relative addressing mode is already as large as mov $16, %eax
, so you're correct about that. (.rodata
is the section where GCC puts string literals, const
variables whose address is taken or otherwise can't be optimized away, etc. Also floating-point constants; loading a constant from memory is normal for FP/SIMD since x86 lacks a mov-immediate to XMM registers, but it's rare even for 8-byte integer constants.)
GCC11 and earlier did mov $16, %eax
/ movw %ax, ldap(%rip)
(https://godbolt.org/z/7qrafWhqd), so that's a GCC12 regression you should report on https://gcc.gnu.org/bugzilla
Loading from .rodata
doesn't happen with x = 16
alone (https://godbolt.org/z/ffnjnxjWG). Presumably some interaction with coalescing two separate 8-bit stores into a 16-bit store trips up GCC.
uint16_t x __attribute__((aligned(4096)));
void store()
{
//ldap.size = 16;
//ldap.pad = 0;
x = 16;
}
# gcc12 -O3 -mtune=znver3
store:
movw $16, x(%rip)
ret
Or with the default tune=generic, GCC12 matches GCC11 code-gen.
store:
movl $16, %eax
movw %ax, x(%rip)
ret
This is optimal for Core 2 through Nehalem (Intel P6-family CPUs that support 64-bit mode, which is necessary for them to be running this code in the first place.) Those are obsolete enough that it's maybe time for current GCC to stop spending extra code-size and instructions and just mov-immediate to memory, since mov imm16
opcodes specifically get special support in the pre-decoders to avoid an LCP stall, where there would be one with add $1234, x(%rip)
. See https://agner.org/optimize/, specifically his microarch PDF. (add sign_extended_imm8
exists, mov
unfortunately doesn't, so add $16, %ax
wouldn't cause a problem, but $1234
would.)
But since those old CPUs don't have a uop cache, an LCP stall in an inner loop could make things much slower in the worst case. So it's maybe worth making somewhat slower code for all modern CPUs in order to avoid that big pothole on the slowest CPUs.
Unfortunately GCC doesn't know that SnB fixed LCP stalls on mov
: -O3 -march=haswell
still does a 32-bit mov
-immediate to a register first. So -march=native
on modern Intel CPUs will still make slower code :/
-O3 -march=alderlake
does use mov-imm16; perhaps they updated the tuning for it because it also has E-cores which are silvermont-family.