Why unlamination of μops necessary?

Question

In "MicroFusion in Intel CPUs." by Denis Bakhvalov, he said:

Unlamination for SandyBridge is described in Intel® 64 and IA-32 Architectures Optimization Reference Manual in chapter “2.3.2.4: Micro-op Queue and the Loop Stream Detector (LSD)”:

The micro-op queue provides post-decode functionality for certain instructions types. In particular, loads combined with computational operations and all stores, when used with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache. In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination, one does the load and the other does the operation

And in HackerNews thread, noted by BeeOnRope:

When instructions are fused at decode, but are “unlaminated” before rename, it usually has similar performance to no fusion at all (but it does save space in the uop cache), since RAT is more likely to be a performance limitation.

In this case, why use unlamination instead of using more μops when the instructions are decoded ? Does it seem unnecessary?

Or is it because that whether a given μop should be unlaminated is uncertain at decode stage, and needs to be dynamically determined according to the CPU usage status at runtime?

score 5 · Accepted Answer · answered Jan 01 '23 at 16:52

Micro-fusion + un-lamination gets the throughput benefits of micro-fusion throughout most of the front-end, only losing it at issue/rename. Without that benefit, more code could run into bottlenecks in those earlier parts of the pipeline, especially legacy decode where any multi-uop instruction has to decode in the one "complex" decoder, not any of the "simple" decoders. https://www.realworldtech.com/sandy-bridge/4/

Sandybridge-family simplified the uop format for the out-of-order parts of the back-end (ROB and RS)¹; fewer transistors for the same number of ROB entries saves power in a power-intensive part of the CPU. The ROB has to keep track of whether both uops have finished executing, and is dealing with physical register numbers since register-rename has already happened on issue/rename/allocate.

It makes sense to me that it would be worth it to decode vaddps ymm0, ymm1, [rdi+rdx*4] to a single uop in the decoders and uop cache, and un-laminate later, rather than not fuse in the first place.

In the decoders, only the one complex decoder can produce more than 1 uop, so any multi-uop instruction that didn't already happen to be first in its decode group ends that group early. Having a bunch of instructions with memory operands using indexed addressing modes could cripple legacy-decode throughput as every such instruction would decode by itself, needing the complex decoder.

In the uop cache, saving space makes sense; 6 uops per "line" isn't very big, and an extra uop for multiple instructions could easy require an extra "line" for the same 32-byte block, reducing cache density and thus hit rate. Unlike the ROB, it only needs to get fetched as part of a block, not indexed to have have a completion port mark it as "done" and ready to retire.

Intel did change things in Haswell to allow keeping more instructions micro-fused: instructions with 2 operands, with a read+write destination, can keep an indexed addressing mode micro-fused, like addps xmm0, [rdi + rdx*4]. But not vaddps xmm0, xmm0, [rdi+rdx*4], unfortunately. See Micro fusion and addressing modes

So apparently they realized or decided it was worth spending a few more bits on ROB entries to reduce un-lamination in a lot of code. A lot of the time CPUs are running scalar code, with instructions like add rdx, [rsi+rcx] or mov [rdi + rcx*4], eax (stores are store-address + store-data uops on Intel CPUs, each writing part of a store-buffer entry), not AVX. Also, the Haswell uop format had to change to accommodate single-uop FMA with 3 inputs; before that Intel uops could have at most 2 inputs. (It wasn't until Broadwell that they took advantage of this to make adc and cmov single-uop; perhaps they wanted disabling FMA via microcode to be an option in case a bug was discovered, so didn't want to hard-wire it into how some baseline x86 instructions were handled, which couldn't be disabled in a CPU that needs to run existing binaries.)

Or is it because that whether a given μop should be unlaminated is uncertain at decode stage, and needs to be dynamically determined according to the CPU usage status at runtime?

Maybe something to that idea about; in pre-decode, instructions get steered to an appropriate decoder. Some opcodes always get steered to the complex decoder, limiting them to 1/clock throughput from legacy decode even if that instance of that opcode actually decoded to a single uop. (At least that's our best theory to explain Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?)

If the pre-decoders had to steer to the complex decoder based on indexed addressing mode, they might do something unfortunate like sending any uop with a SIB to the complex decoder, including add eax, [rsp+16].

It probably also kept parts of the decoders more similar to Nehalem, always micro-fusing memory operands regardless of the addressing mode, if possible for that instruction.

Footnote 1: I don't remember where I read that fact about Intel simplifying the internal uop format in the back-end. It's not in https://www.realworldtech.com/sandy-bridge/ so maybe in https://agner.org/optimize/ (microarch guide)

I perhaps should not have said "usually [the 4 uop rename limit is more important]" - actually I don't really know. I guess that comes from my own experience when microoptomizing: usually you can organize hits in the uop cache so decode limits don't matter much, and usually you can organize things so that the uop cache rules don't limit bandwidth (but this takes care) leaving the 4/cycle limit as the inviolable one. In other scenarios like compiled code or more lightly optimized stuff the other limits may very well be more important. — BeeOnRope, Jan 23 '23 at 02:40
@BeeOnRope: Yeah, for code that spends a lot of time in loops, the uop cache avoids legacy decode bottlenecks on CPUs as they exist now, with micro-fusion. But not micro-fusing uops that can't stay fused in the ROB would cost extra uop cache space and might possibly lead to some 32-byte blocks of some loops not fitting in the uop cache. (Many 2-byte or 3-byte instructions with memory operands could push the average uops per byte upward, and leave space unused at the ends of some lines.) As well as of course taking more uop cache space total, more misses: legacy decode a bit more important — Peter Cordes, Jan 23 '23 at 04:29

Why unlamination of μops necessary?

1 Answers1

Linked