12

In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forward from both stores to satisfy the load?

For example, consider the following sequence:

mov [rdx + 0], eax
mov [rdx + 2], eax
mov ax, [rdx + 1]

The final 2-byte load takes its second byte from the immediate preceding store, but its first byte from the store before that. Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1?

Note that by store-forwarding here I'm including any mechanism that can satisfy the reads from stores still in the store buffer, rather than waiting them to commit to L1, even if it is a slower path than the best case "forwards from a single store" case.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Warning: Your use of 16-bit operands probably causes you to take a length-changing prefix penalty on decode, IIRC. – Iwillnotexist Idonotexist Sep 10 '17 at 01:55
  • 2
    @IwillnotexistIdonotexist: The operand-size prefix is only length-changing for instructions with a 16-bit immediate (which would have been a 32-bit immediate without the prefix). So `add cx, 127` (`66 opcode modrm imm8` is fine, `add cx, 128` (`66 opcode modrm imm16`) is not. Also note that recent Intel CPUs don't LCP-stall on `mov-immediate`, only with other ALU instructions. (And also that LCP stalls only hurt decode, not the uop cache). – Peter Cordes Sep 10 '17 at 20:16
  • @PeterCordes Ah! So I definitely _don't_ recall correctly :-) It used to be a bigger thing on Core 2, and I still have a Penryn machine. – Iwillnotexist Idonotexist Sep 10 '17 at 20:27
  • FWIW, I went with a 16-byte load just so it would fully full contained in _both_ prior stores, whereas a 32-bit load might introduce yet another complication (perhaps not?) because it isn't fully contained in either load (but it is contained in their combination). – BeeOnRope Sep 11 '17 at 19:46

2 Answers2

20

No.

At least, not on Haswell, Broadwell or Skylake processors. On other Intel processors, the restrictions are either similar (Sandy Bridge, Ivy Bridge) or even tighter (Nehalem, Westmere, Pentium Pro/II/II/4). On AMD, similar limitations apply.

From Agner Fog's excellent optimization manuals:

Haswell/Broadwell

The microarchitecture of Intel and AMD CPUs

§ 10.12 Store forwarding stalls

The processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding works in the following cases:

  • When a write of 64 bits or less is followed by a read of the same size and the same address, regardless of alignment.
  • When a write of 128 or 256 bits is followed by a read of the same size and the same address, fully aligned.
  • When a write of 64 bits or less is followed by a read of a smaller size which is fully contained in the write address range, regardless of alignment.
  • When an aligned write of any size is followed by two reads of the two halves, or four reads of the four quarters, etc. with their natural alignment within the write address range.
  • When an aligned write of 128 bits or 256 bits is followed by a read of 64 bits or less that does not cross an 8 bytes boundary.

A delay of 2 clocks occur if the memory block crosses a 64-bytes cache line boundary. This can be avoided if all data have their natural alignment.

Store forwarding fails in the following cases:

  • When a write of any size is followed by a read of a larger size
  • When a write of any size is followed by a partially overlapping read
  • When a write of 128 bits is followed by a smaller read crossing the boundary between the two 64-bit halves
  • When a write of 256 bits is followed by a 128 bit read crossing the boundary between the two 128-bit halves
  • When a write of 256 bits is followed by a read of 64 bits or less crossing any boundary between the four 64-bit quarters

A failed store forwarding takes 10 clock cycles more than a successful store forwarding. The penalty is much higher - approximately 50 clock cycles - after a write of 128 or 256 bits which is not aligned by at least 16.

Emphasis added

Skylake

The microarchitecture of Intel and AMD CPUs

§ 11.12 Store forwarding stalls

The Skylake processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding is one clock cycle faster than on previous processors. A memory write followed by a read from the same address takes 4 clock cycles in the best case for operands of 32 or 64 bits, and 5 clock cycles for other operand sizes.

Store forwarding has a penalty of up to 3 clock cycles extra when an operand of 128 or 256 bits is misaligned.

A store forwarding usually takes 4 - 5 clock cycles extra when an operand of any size crosses a cache line boundary, i.e. an address divisible by 64 bytes.

A write followed by a smaller read from the same address has little or no penalty.

A write of 64 bits or less followed by a smaller read has a penalty of 1 - 3 clocks when the read is offset but fully contained in the address range covered by the write.

An aligned write of 128 or 256 bits followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. A partial read that does not fit into the halves or quarters can take 11 clock cycles extra.

A read that is bigger than the write, or a read that covers both written and unwritten bytes, takes approximately 11 clock cycles extra.

Emphasis added

In General:

A common point across microarchitectures that Agner Fog's document points out is that store forwarding is more likely to happen if the write was aligned and the reads are halves or quarters of the written value.

A Test

A test with the following tight loop:

mov [rsp-16], eax
mov [rsp-12], ebx
mov ecx, [rsp-15]

Shows that the ld_blocks.store_forward PMU counter does indeed increment. This event is documented as follows:

ld_blocks.store_forward [This event counts how many times the load operation got the true Block-on-Store blocking code preventing store forwarding. This includes cases when: - preceding store conflicts with the load (incomplete overlap)

  • store forwarding is impossible due to u-arch limitations

  • preceding lock RMW operations are not forwarded

  • store has the no-forward bit set (uncacheable/page-split/masked stores)

  • all-blocking stores are used (mostly, fences and port I/O)

This indicates that the store-forwarding does indeed fail when a read only partially overlaps the most recent earlier store (even if it is fully contained when even earlier stores are considered).

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
Iwillnotexist Idonotexist
  • 13,297
  • 4
  • 43
  • 66
  • 1
    Brilliant answer. – Margaret Bloom Sep 10 '17 at 11:25
  • One issue is that Agner seems to arbitrarily draw a line at which store forwarding "fails" - but in fact there seem to be several different latency values for different types of misalignments. For example who is to say the 10 or 11 clock case is a store-forwarding _failure_ versus just a really long stall associated with a more complicated, but still successful, forwarding scenario? Or perhaps he was able to use PMU counters to actually determine true failure/success? I don't find any such counters on Skylake though... – BeeOnRope Sep 11 '17 at 19:42
  • 1
    @BeeOnRope A fair question, but one that's possible to answer by looking at the counter. I wrote a tight loop of `mov [rsp-16], eax; mov [rsp-12], ebx; mov ecx, [rsp-15]` and the `ld_blocks.store_forward` counter increments. So Intel, at least, considers a search through the store-buffer to be a failure of store-forwarding, yet it's absolutely certain that the last two entries in the store-buffer will be enough to compute the load value. – Iwillnotexist Idonotexist Sep 11 '17 at 20:10
  • Huh, cool - and intel does use the word "fail" in their documentation of the event, so I guess that more or less settles it! – BeeOnRope Sep 11 '17 at 20:17
  • @IwillnotexistIdonotexist - I added your test to the answer, and included the doc for that event from `pmu-tools` (which gets it from 01.org event listings). What CPU did you test it on? – BeeOnRope Sep 11 '17 at 20:21
  • 1
    @BeeOnRope Danke! Was about to edit that in myself but you beat me to it! EDIT: Haswell i7-4700MQ – Iwillnotexist Idonotexist Sep 11 '17 at 20:21
  • 1
    It's not so much "halves or quarters" of the written value, it's crossing 8-byte boundaries relative to the written value. Notice that a 64-bit store can forward to any fully overlapping 16-bit load. And this is just for cases where the store-forwarding is near-maximum efficiency. The worst-case mentioned is only 11 cycles, not the store-queue flush that would be required to commit to L1D (see discussion on my answer; that's what Bee was really trying to ask about.) – Peter Cordes Sep 11 '17 at 21:30
13

Related: What are the costs of failed store-to-load forwarding on x86? has some more detail on multiple SF stalls not being handled in parallel, but successful SF can happen while an SF stall is in flight.


In-order Atom may be able to do this store-forwarding without stalling at all.

Agner Fog doesn't mention this case specifically for Atom, but unlike all other CPUs, it can store-forward with 1c latency from a store to a wider or differently-aligned load. The only exception Agner found was on cache-line boundaries, where Atom is horrible (16 cycle penalty for a CL-split load or store, even when store-forwarding isn't involved).


Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1?

There's a terminology issue here. Many people will interpret "Can this load be store-forwarded" as asking if it can happen with as low latency as when all the requirements are met for fast-path store-forwarding, as listed in @IWill's answer. (Where all the loaded data comes from the most recent store to overlap any of the load, and other relative/absolute alignment rules are met).

I thought at first that you were missing the third possibility, of slower but still (nearly?) fixed latency forwarding without waiting for commit to L1D, e.g. with a mechanism that scrapes the whole store buffer (and maybe loads from L1D) in cases that Agner Fog and Intel's optimization manual call "store forwarding failure".

But now I see this wording was intentional, and you really do want to ask whether or not the third option exists.

You might want to edit some of this into your question. In summary, the three likely options for Intel x86 CPUs are:

  1. Intel/Agner definition of store-forwarding success, where all the data comes from only one recent store with low and (nearly) fixed latency.

  2. Extra (but limited) latency to scan the whole store buffer and assemble the correct bytes (according to program-order), and (if necessary or always?) load from L1D to provide data for any bytes that weren't recently stored.

    This is the option we aren't sure exists.

    It also has to wait for all data from store-data uops that don't have their inputs ready yet, since it has to respect program order. There may be some information published about speculative execution with unknown store-address (e.g. guessing that they don't overlap), but I forget.

  3. Wait for all overlapping stores to commit to L1D, then load from L1D.

    Some real x86 CPUs might fall back to this in some cases, but they might always use option 2 without introducing a StoreLoad barrier. (Remember that x86 stores have to commit in program order, and loads have to happen in program order. This would effectively drain the store buffer to this point, like mfence, although later loads to other addresses could still speculatively store-forward or just take data from L1D.)


Evidence for the middle option:

The locking scheme proposed in Can x86 reorder a narrow store with a wider load that fully contains it? would work if store-forwarding failure required a flush to L1D. Since it doesn't work on real hardware without mfence, that's strong evidence that real x86 CPUs are merging data from the store buffer with data from L1D. So option 2 exists and is used in this case.

See also Linus Torvalds' explanation that x86 really does allow this kind of reordering, in response to someone else who proposed the same locking idea as that SO question.

I haven't tested if store-forwarding failure/stall penalties are variable, but if not that strongly implies that it falls back to checking the whole store buffer when the best-case forwarding doesn't work.

Hopefully someone will answer What are the costs of failed store-to-load forwarding on x86?, which asks exactly that. I will if I get around to it.

Agner Fog only ever mentions a single number for store-forwarding penalties, and doesn't say it's bigger if cache-miss stores are in flight ahead of the stores that failed to forward. (This would cause a big delay, because stores have to commit to L1D in order because of x86's strongly-ordered memory model.) He also doesn't say anything about it being different cases where data comes from 1 store + L1D vs. from parts of two or more stores, so I'd guess that it works in this case, too.


I suspect that "failed" store-forwarding is common enough that it's worth the transistors to handle it faster than just flushing the store queue and reloading from L1D.

For example, gcc doesn't specifically try to avoid store-forwarding stalls, and some of its idioms cause them (e.g. __m128i v = _mm_set_epi64x(a, b); in 32-bit code stores/reloads to the stack, which is already the wrong strategy on most CPUs for most cases, hence that bug report). It's not good, but the results aren't usually catastrophic, AFAIK.

Dada
  • 6,313
  • 7
  • 24
  • 43
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Just to be clear what is the third option in "Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1?" Note that IMO store-forwarding means that the load is satisfied from the store buffer, but that's not restricted to a single buffer. So I consider the case where the load is satisfied from multiple prior buffered stores a case of store-forwarding (yes, it may be much slower). Now, this might not be the right definition, but it's implicit in the question title. – BeeOnRope Sep 11 '17 at 19:36
  • @BeeOnRope: oh hmm, yeah there's a terminology problem. @ Iwill's "No" answer is correct if we mean "store-forwarded with the most efficient mechanism", and people often say "store forwarding failure" to mean that not happening. But now that I re-read your question, I see that's not what you were asking. – Peter Cordes Sep 11 '17 at 20:40
  • Yes, it's a mostly a question of terminology, but for my question I am drawing the line at needing to commit to L1 or not. The difference between that and some kind of slower-but-still-comes-from-the-store-buffer-approach may be huge if the stores miss to RAM and then subsequent loads hit them (in an overlapping fashion). – BeeOnRope Sep 11 '17 at 20:47
  • @BeeOnRope: Yes, exactly. Good question. Working on an update; I have some evidence that it doesn't have to commit to L1D. – Peter Cordes Sep 11 '17 at 20:48
  • @BeeOnRope: updated my answer. Reordering narrow stores with wider loads is nearly bulletproof evidence that store-forwarding *sometimes* still happens, and Agner talks about the latency / penalty for that being the same as the penalty for multiple overlapping stores. – Peter Cordes Sep 11 '17 at 21:23
  • @PeterCordes a bit out of scope but how does a store-forward work in the following case (at&t): ```test reg0, reg0; jz NEXT; movl reg1, (reg2); NEXT:; movl (reg2), reg3```. How would that not cause a stall? The load meets the criteria for store-forward but it will have to rolled back once the store is. Assuming ```reg0``` is 0. Do the loads inherently have to stall until the store is graudated? – Noah Feb 18 '21 at 07:19
  • 1
    @Noah: You mean if the branch mispredicts? Rollback to a previous snapshot of RAT/ROB state doesn't even try to keep instructions from the wrong path, even if they were also on the correct path (with different preceding instructions). But yes, stores to an unknown address (e.g. use a cmov or load result as the store address) are a problem for memory disambiguation; (https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake) modern Intel CPUs dynamically predict whether a load insn reloads a prev store; can cause mem_order pipeline nukes in single-threaded code. – Peter Cordes Feb 18 '21 at 07:32