Micro-fusion + un-lamination gets the throughput benefits of micro-fusion throughout most of the front-end, only losing it at issue/rename. Without that benefit, more code could run into bottlenecks in those earlier parts of the pipeline, especially legacy decode where any multi-uop instruction has to decode in the one "complex" decoder, not any of the "simple" decoders. https://www.realworldtech.com/sandy-bridge/4/
Sandybridge-family simplified the uop format for the out-of-order parts of the back-end (ROB and RS)1; fewer transistors for the same number of ROB entries saves power in a power-intensive part of the CPU. The ROB has to keep track of whether both uops have finished executing, and is dealing with physical register numbers since register-rename has already happened on issue/rename/allocate.
It makes sense to me that it would be worth it to decode vaddps ymm0, ymm1, [rdi+rdx*4]
to a single uop in the decoders and uop cache, and un-laminate later, rather than not fuse in the first place.
In the decoders, only the one complex decoder can produce more than 1 uop, so any multi-uop instruction that didn't already happen to be first in its decode group ends that group early. Having a bunch of instructions with memory operands using indexed addressing modes could cripple legacy-decode throughput as every such instruction would decode by itself, needing the complex decoder.
In the uop cache, saving space makes sense; 6 uops per "line" isn't very big, and an extra uop for multiple instructions could easy require an extra "line" for the same 32-byte block, reducing cache density and thus hit rate. Unlike the ROB, it only needs to get fetched as part of a block, not indexed to have have a completion port mark it as "done" and ready to retire.
Intel did change things in Haswell to allow keeping more instructions micro-fused: instructions with 2 operands, with a read+write destination, can keep an indexed addressing mode micro-fused, like addps xmm0, [rdi + rdx*4]
. But not vaddps xmm0, xmm0, [rdi+rdx*4]
, unfortunately. See Micro fusion and addressing modes
So apparently they realized or decided it was worth spending a few more bits on ROB entries to reduce un-lamination in a lot of code. A lot of the time CPUs are running scalar code, with instructions like add rdx, [rsi+rcx]
or mov [rdi + rcx*4], eax
(stores are store-address + store-data uops on Intel CPUs, each writing part of a store-buffer entry), not AVX. Also, the Haswell uop format had to change to accommodate single-uop FMA with 3 inputs; before that Intel uops could have at most 2 inputs. (It wasn't until Broadwell that they took advantage of this to make adc
and cmov
single-uop; perhaps they wanted disabling FMA via microcode to be an option in case a bug was discovered, so didn't want to hard-wire it into how some baseline x86 instructions were handled, which couldn't be disabled in a CPU that needs to run existing binaries.)
Or is it because that whether a given μop should be unlaminated is uncertain at decode stage, and needs to be dynamically determined according to the CPU usage status at runtime?
Maybe something to that idea about; in pre-decode, instructions get steered to an appropriate decoder. Some opcodes always get steered to the complex decoder, limiting them to 1/clock throughput from legacy decode even if that instance of that opcode actually decoded to a single uop. (At least that's our best theory to explain Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?)
If the pre-decoders had to steer to the complex decoder based on indexed addressing mode, they might do something unfortunate like sending any uop with a SIB to the complex decoder, including add eax, [rsp+16]
.
It probably also kept parts of the decoders more similar to Nehalem, always micro-fusing memory operands regardless of the addressing mode, if possible for that instruction.
Footnote 1: I don't remember where I read that fact about Intel simplifying the internal uop format in the back-end. It's not in https://www.realworldtech.com/sandy-bridge/ so maybe in https://agner.org/optimize/ (microarch guide)