Dispatching uops from a Frontend to an appropriate execution port

Question

I'm trying to understand the execution flow of uops after they leave Front-End and before they are dispatched into an appropriate execution port.

I currently have the following mental model of it:

Front-End (Fetch, Decode, Micro/Macro-Fuse)
   |
   |
Renamer (Detects dependency chains,
   |      Eliminate RAR, WAW hazards, 
   |      Allocates resources like SB or LB,
   |      Binds the uop to some execution port)
   |
   |
   |______ROB (Holds uop until it is fully completed)
   |
   |
   |
  RS aka Scheduler (Wait until all source operands are ready)
   |
 Execution ports

The thing that I do not understand is how Intel Optimization Manual/2.5.3.1 describes it:

The Renamer is the bridge between the in-order part in Figure 2-10, and the dataflow world of the Scheduler. It moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine. Although the renamer can send up to 4 micro-ops (unfused, micro-fused, or macro-fused) per cycle, this is equivalent to the issue port can dispatch six micro-ops per cycle. In this process, the out-of-order core carries out the following steps:

Renames architectural sources and destinations of the micro-ops to micro-architectural sources and destinations.

Allocates resources to the micro-ops. For example, load or store buffers.

Binds the micro-op to an appropriate dispatch port.

But the Scheduler (aka RS) is described as follows at 2.5.3.2:

The scheduler controls the dispatch of micro-ops onto their execution ports. In order to do this, it must identify which micro-ops are ready and where its sources come from: a register file entry, or a bypass directly from an execution unit. Depending on the availability of dispatch ports and writeback buses, and the priority of ready micro-ops, the scheduler selects which micro-ops are dispatched every cycle.

QUESTION: Is the port to dispatch an uop into selected by the Renamer, but the dispatching to the port selected by the Renamer is done by the RS (aka Scheduler) as soon as all operands are ready?

I measured port distribution for memory copy routine based on vmovdqu and got almost uniform distribution:

13 493 383 038      uops_dispatched_port.port_2
13 494 860 751      uops_dispatched_port.port_3

This is not clear how it is achieved simply by the Renamer. It does not know when all operands for the uop will become ready so it is difficult to choose which port to dispatch uop into to achieve uniform uops distribution.

You may have seen [How are x86 uops scheduled, exactly?](https://stackoverflow.com/q/40681331/555045), did that not answer this question? — harold, Feb 14 '20 at 12:07
The renamer can achieve a uniform distribution by simply alternating between the two ports. The problem is that uniform is not always optimal :) Some uops may have to wait even if the other port is free. — Margaret Bloom, Feb 14 '20 at 12:08
@MargaretBloom Simply altering between two ports does not seem to be a way to achieve uniform distribution. After an uop comes into RS it has to wait till the operands are ready which is unpredictable. I can imagine the following scenario: 2 load uops are bound by the renamer to the same port (say p3), then their operands are ready at the same time which results in one `uop` to be scheduled 1c later, but the other port (p2) might be free. — St.Antario, Feb 14 '20 at 12:17
@harold Yes, I read the question before but did not find the answer to my much trivial question at first. The [patent](https://patents.google.com/patent/US5689674) mentioned in the answer seems to a way to research into. The problem is the first sentence in the abstract is _A method and apparatus for binding instructions to dispatch ports in a reservation station includes a counter mechanism and a port identifier_ which suggests that it is done in RS, not the Renamer. — St.Antario, Feb 14 '20 at 12:20
@MargaretBloom: The TL:DR is that the uop allocator counts issued and completed uops so it know (by dead reckoning) how many uops are currently in the RS (or were a cycle or two ago), and thus can bias itself towards the possible port with fewer in-flight uops left to exec. Or simply always pick the port with the shortest queue. Something like that, been a while since I looked in that level of detail at the answer to the duplicate that harold linked. — Peter Cordes, Feb 14 '20 at 12:21
@St.Antario Exactly, but you are asking about a *uniform distribution of the uops* between the ports (you counted the uops in your test), not about the optimal distribution of the load. The former is not optimal, as you have highlighted. Long story short, AFAIR, the uops are bound at issue time with a heuristic which is not always optimal but I've never looked into it thoughtfully. — Margaret Bloom, Feb 14 '20 at 13:27
@PeterCordes I've always pictured the renamer as doing pretty much what you said: keeping a counter of inflight uops per port. I've always found more interesting to know why can't the uops be assigned to ports by the scheduler. — Margaret Bloom, Feb 14 '20 at 13:31
@MargaretBloom: I assume that would significantly increase the complexity of the logic that picks up-to-1 uop for each port per cycle from the whole set RS. With them pre-scheduled, you probably just match the first uop with that port number (out of ones where an inputs-ready match is true). Otherwise dispatch of a hypothetical p15 uop depends on which other uops are ready that cycle, p0156, p1, p5? The current alloc/rename stage also isn't on a latency critical path. (And fun fact: Skylake's RS isn't strictly unified anymore. BeeOnRope found or created some info about that; I forget where.) — Peter Cordes, Feb 14 '20 at 13:48
@MargaretBloom: Unless you have *good* last-minute dispatching that can avoid having port 5 grab the first p0156 uop it sees, even if there are lots of p5 uops in flight, it's better to do it during allocation. Presumably good-enough last-minute uop allocation would be prohibitively expensive in power, or at least not worth the power cost. Instead of getting scheduled once, every uop would have to get considered multiple times. (You should post your question as a real question so I can post these comments as an answer; it's a good question, one that I used to wonder about myself.) — Peter Cordes, Feb 14 '20 at 13:54
@PeterCordes Thank for your useful insight. Yeah, that would make a good question. I think I'll ask it later as soon as I have 15 min free. If you have time, you can go ahead and ask it your self, just let me know so we don't step on each other toes :) — Margaret Bloom, Feb 14 '20 at 13:57
@MargaretBloom: if I start writing one, I'll leave a comment here. Otherwise will keep an eye out for your version of it. — Peter Cordes, Feb 14 '20 at 14:09
@MargaretBloom _I've always found more interesting to know why can't the uops be assigned to ports by the scheduler._ This is exactly that I was confused about. Before carefully reading Intel docs I had a mental model that Scheduler chooses the port to assign uops to. — St.Antario, Feb 14 '20 at 15:01

Dispatching uops from a Frontend to an appropriate execution port

0 Answers0