I am under the impression that each micro-op is 8 bytes after looking at u-op caches for a while but my question is are all micro-ops the same size, even fused domain micro-ops?
1 Answers
This detail is not documented by x86 chip vendors. However, uops need to be simple enough so that they can be decoded within a fraction of a cycle. This is in contrast to x86 instructions where an instruction requires at least one cycle to be decoded (although multiple instructions can be decoded in the same cycle). So making uops of the same size with fairly uniform format greatly helps achieve this. I think most probably fused-domain and unfused-domain uops are all of the same size on most x86 processors. In Intel processors, uops in the uop cache can be of different sizes depending on whether a uop has an immediate and/or a displacement operand. On the other hand, the IDQ can accommodate a fixed number of uops without conditions on what the uops are, which suggests that each uop in the IDQ occupies the same amount of space. The size of a fused-domain uop might be different than that of an unfused-domain uop. But for micro-fusion to be of any use, the size of a fused-domain uop must be strictly smaller than twice the size of an unfused-domain uop. Also I think we can logically say that the size of a fused-domain uop is at least as large as the size of an unfused-domain uop.

- 22,259
- 3
- 54
- 95
-
3We know that in the uop-cache, uops don't purely go into fixed size-slots. Agner Fog found that uops with large immediates / displacements can "borrow" space from other uops in the same cache line if there's room (https://agner.org/optimize/ in the Sandybridge section of the microarch PDF). But in the fused-domain IDQ, uops *are* fixed-width, because the size of the loop buffer doesn't depend on what kind of uops they are. And BTW, `cmp [rdi+1000000000], 0x12345` is a single uop with imm32 and disp32 (8 bytes), but it's slow to fetch from the uop cache on SKL with multiple back-to-back. – Peter Cordes Oct 22 '18 at 04:06
-
1And with unlamination of (some) micro-fused uops with indexed addressing modes happening only at issue/rename (or when adding to the IDQ?) it's clear that not all parts of the fused-domain use the same representation / format. ([Micro fusion and addressing modes](https://stackoverflow.com/q/26046634)). In the ROB, uops need a physical register index, rather than (or as well as?) an architectural register index. – Peter Cordes Oct 22 '18 at 04:07
-
@Peter Cordes I read somewhere that each line in the uop cache has metadata stating how many uops the line has and the instruction length information, so It's almost definite that they aren't the same length. You seem to know a lot about this topic, take a look at a related question I asked: https://superuser.com/questions/1368480/how-is-the-micro-op-cache-tagged – Lewis Kelsey Oct 22 '18 at 10:48
-
1@LewisKelsey: The instruction-length info is for recording the length of the *x86* instructions, not the length of the uops. There are fixed-size slots, but with some scheme for borrowing storage from other uops in the same cache way for instructions with lots of immediate data. Remember that 1 x86 instruction can decode to multiple uops, like `haddps` is 3. – Peter Cordes Oct 22 '18 at 18:03
-
2It also depends on exactly what the OP is asking. In a _logical_ sense uops probably have very different sizes since some uops need a lot more information, such as address offets, immediate values, up to 4 arguments, etc - that other uops don't have (imagine a `nop`), so these fields are "unused" in the uops that are logically smaller. _Physically_ the uops cache is a fixed structure: the physical hardware can't change, so any given "slot" looks the same over time. It is _possible_ that different slots have different lengths, but I don't think this is true either: most evidence ... – BeeOnRope Oct 22 '18 at 20:03
-
... seems to suggest that all slots are basically identical in capability so I expect them to the same physical size. So even a logically "small" uop has the same fields _in hardware_ as a large one, but they are just unused, zero, whatever. Then on top of that, there is a sharing scheme, that @PeterCordes mentions above, to allow some uops to borrow space from nearby uops that don't need them: in this way _logically_ larger uops can in some sense be _physically_ larger as well, although if sharing is successful, the total slot size is the same. So in large part it is a matter of semantics. – BeeOnRope Oct 22 '18 at 20:06
-
@LewisKelsey See also BeeOnRope's comment. – Hadi Brais Oct 22 '18 at 20:13
-
Coming back to this, I have seen values such as 72bit and 118bit cited for the uops on P6, for instance on Wikipedia. There might be 2 or 3 separate sizes for fused / unfused immediate / no immediate indirect branch target etc. The larger size might not fit in one uop cache slot in the way. It fits in two. – Lewis Kelsey Mar 03 '21 at 22:25