question about Micro-op fusion related with ROB entry occupation and 'Micro-op fusion' in Agner's doc

Question

1- In chapter 8 ('Intel Core 2 and Nehalem pipeline') of his micro-architecture doc, Agner Fog said:

8.4 Micro-op fusion
The fused μop is treated as two μops by the scheduler and submitted to two different execution units, but it is treated as one μop in all other stages in the pipeline and uses only one entry in the reorder buffer.

While in 'Computer Organization and Design' RISC-V version book by Patterson and Hennessy, it says in subsection 'The Intel Core i7 920' which is Bloomfield architecture in '4.11 Real Stuff: The ARM Cortex-A53 and Intel Core i7 Pipelines'

Microfusion in the fourth step combines micro-operation pairs such as load/ALU operation and ALU operation/store and issues them to a single reservation station (where they can still issue independently), thus increasing the usage of the buffer

Does they point to same buffer? If not, what buffer does the book refer to?

2- In Agner's doc 8.4 section, it says the following two sentence block:

Instructions that have both a rip-relative address and immediate data cannot use μop fusion

Why failure exist with immediate?

2.1- Agner gives immediate failure reason with macro-op:

There is not enough space for storing both an immediate operand, the address of a memory operand, and the address of a branch target in the same ROB entry. My guess is that this is the reason why we can't have macro-op fusion with both a memory operand and an immediate operand. This may also be the reason why macro-op fusion doesn't work in 64- bit mode on Core2: 64-bit branch addresses take more space in the ROB entry.

Does this reason also results in micro-op condition when with both mem and imm?

The reservation station is the scheduler, not the reorder-buffer. (e.g. https://www.realworldtech.com/merom/5/). RISC-V microarchitectures are different than Intel's apparently. 2.1 - micro-fusion is possible with other addressing modes. Perhaps the uop stores the full 48-bit absolute address, if the front-end handles the RIP-relative addressing mode to produce an absolute during decode? Or something the physical bits are used like a union, and some signalling mechanism can't encode both micro-fused and RIP-relative. — Peter Cordes, Jun 03 '23 at 06:30
1- Sorry for misspelling the reference author's name.2- I have updated the questions adding the description of COD context. The excerpt from COD is also about intel cpu. So why does COD say 'increasing the usage' 3- Maybe the architecture detail is too complex. After all, Agner says in his microarchitecture doc 'My guess ...' — zg c, Jun 03 '23 at 07:52
Maybe Agner was wrong in his reverse-engineered conclusion about micro-fused uops taking 2 scheduler entries in Core 2 and Nehalem (P6-family). It's true in Sandybridge-family, but SnB simplified the uop format. I seem to recall some previous discussion with someone about P6-family keeping uops fused even in the scheduler, which would naturally take more space in the very power-intensive scheduler. — Peter Cordes, Jun 03 '23 at 12:59
Agner's conclusion is *uses only one entry in the reorder buffer* which is in Core 2 and also in his section 9.5 'Intel Sandy Bridge and Ivy Bridge pipeline -> Micro-op fusion' where he says *The processor uses μop fusion in the same way as previous processors* 1- Do you mean Core 2 take 2 entries but SnB(Sandybridge) decreases to 1? If so, I understand the question. Thanks. — zg c, Jun 06 '23 at 11:43
2-Additional [info](https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge_(client)#Renaming_.26_Allocation) recently found which says 'two fused µOPs only occupy a single entry in the ROB' in sandy bridge — zg c, Jun 06 '23 at 11:58
No, I mean the opposite, that perhaps Core2 takes 1 **scheduler** (RS) entry for a micro-fused uop. I'm pretty sure SnB takes 2 scheduler entries for the halves of a micro-fused uop, since they simplified the uop format, making it more compact (but also unable to keep indexed addressing modes micro-fused, except sometimes in Haswell and later: [Why unlamination of μops necessary?](https://stackoverflow.com/q/74975717). They both only take one **ROB** entry for a micro-fused uop; that's a well-known fact we can see more directly from performance counters for retire slots. — Peter Cordes, Jun 06 '23 at 13:24
I don't remember for sure whether Core2/Nehalem can keep both halves of a micro-fused uop in the same RS (aka scheduler) entry. If so, then Agner is wrong. But if not then Agner's right and the book is wrong. This is what they disagree about, and agree about everything else. Taking fewer entries for the same number of uops makes the RS *effectively* bigger, able to hold uops from more instructions. — Peter Cordes, Jun 06 '23 at 13:27
1- I understand now, and I will try using 'performance counters for retire slots' (which I didn't know before) similar to this [easyperf blog](https://easyperf.net/blog/2018/12/29/Understanding-IDQ_UOPS_NOT_DELIVERED) says by using `event=0xc2,umask=0x2` as [kernel code](https://lore.kernel.org/lkml/1462489447-31832-4-git-send-email-andi@firstfloor.org/) says. — zg c, Jun 07 '23 at 06:03
2- To conclude my question based on your comments, 2.1 my first question 'buffer' is 'scheduler' (i.e. reservation station), and both use one ROB entry while scheduler entry count needs self test because of the conflict between two docs, So how to test? Do it still use 'performance counters' for something? 2.2- The reason for why 'immediate' not allowed with μop fusion may be not clear, maybe because of store problem or encoding problem? Am my conclusion right? — zg c, Jun 07 '23 at 06:08
See [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/q/51986046) for the kind of experiment that can measure effective scheduler size. If micro-fused loads (e.g. of data not stored recently but not in L1d cache) reduce the out-of-order window size, they're taking separate entries. But it's not easy to construct a good benchmark because independent loads executing 2/clock can get out of the RS and make room for new imul uops very fast. So perhaps some kind of data dependency for the load address. — Peter Cordes, Jun 07 '23 at 12:54
I guess perhaps `imul rax, [rdi+rax]` where RAX=0 and RDI points to some data? Or `imul rax, [rax]` where `[rax]` points to `1`. Then the load is the bottleneck, so it runs at a different speed than `imul rax, rdi`, but the cutoff point should be the same in number of `imul` operations where the slope changes as the RS can't fully overlap both dep chains all the time. Or should be a factor of 2 lower if micro-fusion doesn't apply to the RS, so it's a big effect we're looking for. — Peter Cordes, Jun 07 '23 at 12:54
Re: measuring retire slots, yes, but you don't need to look up event numbers. `perf stat -e uops_retired.retire_slots` on Skylake at least. Modern `perf` has human-readable names for uarch-specific event numbers, you don't even need a wrapper like `ocperf.py`. Use `perf list` to see the names of events. (`uops_issued.any` is the corresponding front-end event which also counts in the fused domain, but after un-lamination if it happens. uops_issued can be higher than uops_retired in case of mis-speculation, but otherwise AFAIK they're always the same.) — Peter Cordes, Jun 07 '23 at 13:00
Thanks. I will try measuring when having some time. Reply to above comments. 1. what does '2/clock' mean ? what two things ? 2. With my ryzen 4800h cpu, I installed perf by `yay -S perf` in archlinux, while `perf list` has no `uops_issued` while in this [link](https://relate.cs.illinois.edu/course/cs598apk-f18/f/demos/upload/perf/Using%20Performance%20Counters.html) `uops_issued` exists. I have `uops_retired \n [Micro-ops Retired]` in `perf list -v` — zg c, Jun 14 '23 at 02:12
(1) "2/clock" means a throughput of 2 loads per clock cycle, "2 per clock". (2) `uops_issued.any` is an event on my Intel Skylake CPU. I'm not surprised that AMD CPUs have a different selection of events, or maybe different names for microarchitecturally similar things. Your question was asking about Core 2 / Nehalem, so I was answering based on how to measure things on Intel CPUs. AMD has different scheduling queues for integer ALUs vs. load ports in Zen 2 for example (https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram), not a unified scheduler. — Peter Cordes, Jun 14 '23 at 02:25
Thanks. 1. The book is based on intel, while temporarily I only have one amd zen2 laptop. 2.Back to original questions. In this [comment](https://stackoverflow.com/questions/76394605/question-about-micro-op-fusion-related-with-rob-entry-occupation-and-micro-op-f#comment134753365_76394605), there may not be one fixed answer to 'immediate failure reason with macro-op', is that true? — zg c, Jun 14 '23 at 03:14
2. I don't know exactly why Intel can't micro-fuse a uop that has an immediate and a rip-relative operand. Maybe something about the uop format they use. Separate from that, some SIMD instructions with an immediate can't ever micro-fuse, even with simple addressing modes like `[rdi]`. For example, `vextractf128 [mem], ymm, imm8` on Intel is 2 separate uops for the front-end, for port 2/3 (store-address) and port 4 (store-data). Maybe the store-data uop has to encode the high-half extraction since there's no separate shuffle uop, but `movhps` is a pure store that can micro-fuse — Peter Cordes, Jun 14 '23 at 03:58
I understand the question now although the second may be difficult to give one answer. Thanks. — zg c, Jun 15 '23 at 08:52

question about Micro-op fusion related with ROB entry occupation and 'Micro-op fusion' in Agner's doc

0 Answers0