5

On modern Intel1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch2, or when they complete3, or somewhere in-between4?


1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task.

2 Dispatch here means leave the RS for execution.

3 Complete here means when the load data returns and is ready to satisfy dependent uops.

4 Or even somewhere outside of the range of time defined by these two events, which seems unlikely but possible.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/206639/discussion-on-question-by-beeonrope-are-load-ops-deallocated-from-the-rs-when-th). – Bhargav Rao Jan 25 '20 at 21:00
  • @PeterCordes and BeeOnRopes a few questions about the chat: 1) re: L1/L2 cache line splits taking 2x + 1cycles. Could it be a memory ordering thing? I.e the CPU needs to make sure the two loads are consistent? 2) re: "So apparently the core spams the uops in case the load arrived in time for that cycle?" was this ever confirmed? BeeOnRope somewhat refuted it because it doesn't scale with L3 / RAM access but just want to confirm. Re: " instructions dependent on the load, that will dispatch 0 or 1 cycles after the load, are subject to replay" Would this scale for say... – Noah Jun 02 '21 at 06:24
  • `movl (rax), edx; leal (rdx), ecx; leal (rdx), edi; leal (rdx), esi`... On same ICL with 4 ports for `lea` would all 3 of the `lea` above be replayable? What if its more uops that `RAT` bandwidth? 4) If the uops are not replayed in a loop is there an idea for when they will get redispatched? Is it only if there is no contention for the port (hopefully) or can it actually add extra bottlenecks? 5) Will replay always be on the same port the instruction was dispatched too? – Noah Jun 02 '21 at 06:29
  • @Noah: Presumably yes, if all the LEAs got scheduled to different ports when they first dispatched. (Also, please don't use mutant hybrids of Intel and AMD syntax. If you don't like `%` on reg names, use Intel syntax, and if you're trying to save space, don't use redundant operand-size suffixes like `leal`.) Obviously only 1 uop can be dispatched to a single port in a single cycle, and I don't think the RAT would dispatch another uop in the cycle *after* a load failed to arrive. And the oldest-ready-first scheduling rule still applies, this is just a new way to count as optimistically-ready – Peter Cordes Jun 02 '21 at 06:30
  • 1
    Is the RAT even involved in replays? I don't think the uop has to be renamed again, so I assumed it would be something downstream of that. I did some fair amount of investigation into replays but couldn't come up with a hard and fast rule. Almost always uops that could dispatch as soon as the load came back (e.g., all the `lea` in your example) would replay, but also uops that would dispatch a cycle later due to port conflicts and dependencies would often replay, and sometimes more than that. I couldn't come up with an exact bright line "horizon" in cycles from the load result where stuff \ – BeeOnRope Jun 03 '21 at 16:57
  • 1
    would replay: if I picked a specific number I found counter-examples on both sides. I can't remember if the same test repeated also showed variability or non-integer number of replays (averaged over may iterations), either. It is possible there is something involved in replay that operates at half frequency, or a structure where only a part of the structure is scanned each cycle, leading to variable replay behavior. – BeeOnRope Jun 03 '21 at 16:59
  • @BeeOnRope I think replay can also occur on `L1` hits if there is a dependency on the memory value thats not satisfied yet. [Came up when discussing an LLVM peephole.](https://reviews.llvm.org/D140087). – Noah Dec 20 '22 at 19:51
  • @Noah - interesting thread. In your test it is always the same location that is being RMW'd, right? This causes a lot of pressure on store-forwarding which I have also observed to cause a high replay count (or at least high uop count like in that thread which I assume is related to replay). BTW, although `btr` with memory destination (RMW) is pretty terrible and generally best avoided, just loading into a register and then doing the `btr` on the register and storing it back would probably be a big win here relative to the shift-and-RMW approach. – BeeOnRope Dec 22 '22 at 17:26
  • BTR (and related bit test ops) with RMW is a complex microcoded thing because of the semantics of the addressing which I think is unique to the bit ops: the bit index is essentially combined with the address to allow it to access arbitrary bytes, not just the 2/4/8 pointed to by the address argument. – BeeOnRope Dec 22 '22 at 17:36
  • @BeeOnRope re:**"In your test it is always the same location that is being RMW'd, right?"** Yes. Is it SF ingeneral causing lots of replay? Or only if there is a more than 5c (L1 latency) dependency on the memory value? The latter would be same sense as the same mechanism causing replay for L1 misses would essentially apply. re:**"just loading into a register and then doing the btr on the register and storing it back would probably be a big win here relative to the shift-and-RMW approach."** Good point, I'll add that (still trying to get `btr` on memory for `atomic_{and/or/xor}` that gets \ – Noah Dec 23 '22 at 20:57
  • @BeeOnRope `cmpxchg` loop codegen. – Noah Dec 23 '22 at 20:58
  • 1
    @Noah wrote: _Is it SF ingeneral causing lots of replay?_ I am not sure, though I think we can say that store fowarding falls into the type of scenario that causes replay: things with variable latency. When store forwarding occurs, the latency is differnet than L1 hit, almost always, and in the case of a STLF which resolves later that the earliest it can (i.e., where the store is not ready when the load first probes the queue) the latency is pretty much unknown (unlike say the L1 miss case where the CPU seems to "guess" that the latency will be that of an L2 hit). – BeeOnRope Dec 26 '22 at 17:39
  • 1
    So it is not really surprising to see replays there if we consider that replays are the primary way of handling variable-latency events. I saw them even in just the basic scenario (no need for more than 5c of latency as you asked), where stores and loads to a single location are interleaved. Keep in mind that the minimum STLF latency is 3 cycles, which is _less_ than the minimum L1 latency of 4 cycles. I guess the CPU uses the STLF predictor to also hint to the scheduler than the load will take 3 cycles rather than 4 or 5. – BeeOnRope Dec 26 '22 at 17:43

2 Answers2

5

The following experiments suggest that the uops are deallocated at some point before the load completes. While this is not a complete answer to your question, it might provide some interesting insights.

On Skylake, there is a 33-entry reservation station for loads (see https://stackoverflow.com/a/58575898/10461973). This should also be the case for the Coffee Lake i7-8700K, which is used for the following experiments.

We assume that R14 contains a valid memory address.

clflush [R14]
clflush [R14+512]
mfence

# start measuring cycles

mov RAX, [R14]
mov RAX, [R14]
...
mov RAX, [R14]

mov RBX, [R14+512]

# stop measuring cycles

mov RAX, [R14] is unrolled 35 times. A load from memory takes at least about 280 cycles on this system. If the load uops stayed in the 33-entry reservation station until completion, the last load could only start after more than 280 cycles and would need another ~280cycles. However, the total measured time for this experiment is only about 340 cycles. This indicates that the load uops leave the RS at some time before completion.

In contrast, the following experiments shows a case where most uops are forced to stay in the reservation until the first load completes:

mov RAX, R14
mov [RAX], RAX
clflush [R14]
clflush [R14+512]
mfence

# start measuring cycles

mov RAX, [RAX]
mov RAX, [RAX]
...
mov RAX, [RAX]

mov RBX, [R14+512]

# stop measuring cycles

The first 35 loads now have dependencies on each other. The measured time for this experiment is about 600 cycles.

The experiments were performed with all but one core disabled, and with the CPU governor set to performance (cpupower frequency-set --governor performance).

Here are the nanoBench commands I used:

./nanoBench.sh -unroll 1 -basic -asm_init "clflush [R14]; clflush [R14+512]; mfence" -asm "mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RBX, [R14+512]"

./nanoBench.sh -unroll 1 -basic -asm_init "mov RAX, R14; mov [RAX], RAX; clflush [R14]; clflush [R14+512]; mfence" -asm "mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RBX, [R14+512]"

Andreas Abel
  • 1,376
  • 1
  • 10
  • 21
  • 2
    Thanks Andreas. I am not ignoring this, I just haven't time to absorb it completely yet. – BeeOnRope Feb 04 '20 at 23:13
  • @BeeOnRope have you absorbed it? Any ideas on why the independent loads free up earlier? – Noah Jun 02 '21 at 05:45
  • @Noah - yes. This and the other answer seem fairly convincing. As for "why" you'd want to do this: well it frees up the RS entries sooner and potentially allows other uops (probably load uops) from starting w/o needing to wait for RS entries held by long-running cache misses. I just didn't think it worked like that, probably because of a misunderstanding of how replay worked. – BeeOnRope Jun 03 '21 at 07:46
5

Just came across this question. Here is my attempt at an answer.

Short Answer: I'm still a bit uncertain about some parts but based on some measurements using various performance counters along with performance monitoring interrupts, it "looks like" the load uop gets removed from RS during the same cycle it is dispatched to load ports or at least very shortly afterwards.

Details: A while ago I tried writing a kernel module which mimics the ideas here. The blog post linked describes the idea really well so I won't explain it in detail here. The main idea is to trigger a performance monitoring interrupt after a set number of cycles have elapsed, freeze all counter values (currently tracked), store them and reset/repeat. Doing this for 1, 2, ... n cycles gives us some picture of what is going on micro-architecturally at the cycle granularity. How accurate of a picture is a different story... The source for the kernel module I used for measuring can be found here.

Long Answer: I profiled the following code below using the kernel module mentioned above on a i7-1065G7 (Ice Lake) and tracked 11 different performance counters. Prior to the mov instruction profiled, clflush was called on the address stored in r8. This was done so that the load would take long enough to make it easy to tell whether the uop was removed from RS before, after or during execution (otherwise the load completes in about 4 cycles). In total I measured up to 600 cycles with most of the events which are of interest in this question happening within 65 cycles. To account for noise I did 1024 trials for each cycle and stored the counter value which occurred the most. Luckily for each cycle in the chart below and each counter I only saw deviations in value from at most a single trial with the remaining 1023 trials giving the same counter values.

 563:   0f 30                   wrmsr  
 565:   4d 8b 00                mov    (%r8),%r8
 568:   0f ae f0                mfence 
 56b:   0f ae e8                lfence

The counters tracked are listed below. Descriptions are summarized from Intel SDM.

  INST_RETIRED_ANY_P:          To track when wrmsr retired
  RS_EVENTS_EMPTY_CYCLES:      Count of cycles RS is empty
  UOPS_DISPATCHED_PORT_PORT_0: # uops dispatched to port 0
  UOPS_DISPATCHED_PORT_PORT_1: # uops dispatched to port 1 
  UOPS_DISPATCHED_PORT_2_3:    # uops dispatched to port 2,3 (load addr ports)
  UOPS_DISPATCHED_PORT_4_9:    # uops dispatched to port 4,9 (store data ports)
  UOPS_DISPATCHED_PORT_PORT_5: # uops dispatched to port 5
  UOPS_DISPATCHED_PORT_PORT_6: # uops dispatched to port 6
  UOPS_DISPATCHED_PORT_7_8:    # uops dispatched to port 7,8 (store addr ports)
  UOPS_EXECUTED_THREAD:        # uops executed
  UOPS_ISSUED_ANY:             # uops sent to RS from RAT

The table below lists each counter value at each cycle. So based on the table below one uop is sent to RS at cycle 47 and occupies the RS for cycles 51-54. This is presumably the load uop. At cycle 54 RS_EVENTS_EMPTY_CYCLES and UOPS_DISPATCHED_PORT_2_3 increment which means (at least how I'm interpreting it) that the load uop has been dispatched and is freed from the RS.

What I'm not sure about is that at cycle 52 three more uops are issued to the RS. They seem to arrive and occupy the RS for cycle 55-58. But only two uops are dispatched to the execution ports and the RS is emptied. Regardless by cycle 59 the RS is empty (count is increasing each cycle). The load completes and mov retires about 500 cycles later.

+-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+
| Cycle | Inst Retired | Cycles RS Empty | Port 0 | Port 1 | Port 2,3 | Port 4,9 | Port 5 | Port 6 | Port 7,8 | uops executed | uops issued to RS |        Comments        |
+-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+
|     1 |            0 |               3 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
|     2 |            0 |               4 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
|     3 |            0 |               5 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
|     4 |            0 |               6 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 | 2 uops issued          |
|     5 |            0 |               7 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|     6 |            0 |               8 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|     7 |            0 |               9 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|     8 |            0 |              10 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|     9 |            0 |              11 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    10 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    11 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    12 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    13 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    14 |            0 |              13 |      0 |      0 |        0 |        0 |      0 |      1 |        0 |             3 |                 2 |                        |
|    15 |            0 |              14 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             3 |                 2 | 2 uops dispatched      |
|    16 |            0 |              15 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             4 |                 2 |                        |
|    17 |            0 |              16 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 | 2 uops executedd       |
|    18 |            0 |              17 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    19 |            0 |              18 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    20 |            0 |              19 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    21 |            0 |              20 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    22 |            0 |              21 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    23 |            0 |              22 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 5 |                        |
|    24 |            0 |              23 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 | 4 uops issued          |
|    25 |            0 |              24 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    26 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    27 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    28 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    29 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    30 |            0 |              25 |      0 |      1 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    31 |            0 |              26 |      0 |      1 |        0 |        0 |      0 |      3 |        0 |             5 |                 6 |                        |
|    32 |            0 |              27 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             6 |                 6 |                        |
|    33 |            0 |              28 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             7 |                 6 |                        |
|    34 |            0 |              29 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 | 3 uops executed        |
|    35 |            0 |              30 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    36 |            1 |              31 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 | wrmsr retired          |
|    37 |            1 |              32 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    38 |            1 |              33 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    39 |            1 |              34 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    40 |            1 |              35 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    41 |            1 |              36 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    42 |            1 |              37 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    43 |            1 |              38 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    44 |            1 |              39 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    45 |            1 |              40 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    46 |            1 |              41 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    47 |            1 |              42 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    48 |            1 |              43 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 | 1 uop issued           |
|    49 |            1 |              44 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
|    50 |            1 |              45 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
|    51 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
|    52 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 | 3 uops issued          |
|    53 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 |                        |
|    54 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 | port 2,3 load addr     |
|    55 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             8 |                10 |                        |
|    56 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             8 |                10 | executing load         |
|    57 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             9 |                10 |                        |
|    58 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             9 |                10 | port 4,9 store data    |
|    59 |            1 |              48 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |             9 |                10 | port 7,8 store address |
|    60 |            1 |              49 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |             9 |                10 |                        |
|    61 |            1 |              50 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 | 2 uops executed        |
|    62 |            1 |              51 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
|    63 |            1 |              52 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
|    64 |            1 |              53 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
|    65 |            1 |              54 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
+-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+

So based on the table it looks like the load uop is removed from the RS either at the same time as dispatching to load port or a couple of cycles later. I did some sanity checking of the values in the chart and for the most part all the counter values makes sense. Two things I haven't figure out is the fact that 4 uops are to be sent to RS (cycle 24) but only 3 gets executed (cycle 35). Similarly 3 uops is issued at cycle 52, but only 2 are executed (cycle 61)

Thanks

bsghost
  • 351
  • 2
  • 3
  • This is brilliant. – BeeOnRope Jun 03 '21 at 07:46
  • 1
    "a couple cycles later" would be consistent with how we think optimistic dispatch works, for uops whose input comes from a load. The RS dispatches in the cycle when the load result will be on the bypass-forwarding bus, *if* the load hit in L2 cache (after it already missed in L1d). If the data doesn't arrive then, that uop will have to get replayed again later, when the load eventually does complete. – Peter Cordes Jun 03 '21 at 07:58
  • 1
    (The cache-miss load itself doesn't need replaying; it already left the RS and the load buffer is tracking it. Just uops that dispatched in anticipation of it completing, so you know within a cycle or two whether uops got their data and were successfully dispatched, or whether they didn't and you need to keep them in the RS to dispatch again later.) – Peter Cordes Jun 03 '21 at 07:58