What are the semantics of store forwarding under the x86 memory model?

Question

It's not too difficult to figure out the semantics of present CPUs by experimentation; but what I want to know is, what semantic guarantees are provided by the architectural memory model? I've read volume 3, chapter 8 of the intel SDM, and I believe it is ambiguous. Am I missing something?

The manual nominally says:

A read which succeeds a write in program order and is not from the same location may be ordered before the write.
A read which succeeds a write in program order and is from the same location happens after it.

It also says:

In a multiple-processor system, the following ordering principles apply:

Memory ordering obeys causality

I'll come back to this.

Just one sentence is later devoted to store forwarding:

The processor-ordering model described in this section is virtually identical to that used by the Pentium and Intel486 processors. The only enhancements in the Pentium 4, Intel Xeon, and P6 family processors are:

Store-buffer forwarding, when a read passes a write to the same memory location

There's also an example (§8.2.3.5); examples are not generally taken to be normative, but we'll take it at face value for now. From what I understand, the effect of the example is basically to amend #2 above to:

If a read is from the same location as a write which it succeeds in program order, the read must happen 'after' the write in that it either must observe the write, or another write made by another core to the same location which happens after the first write; however, in the first case, the read may happen before the write is made visible to other processors.

The example is of a situation where this makes a difference; I've elided it for brevity. Another example is given here, which was also the motivation for this question.

However, this notion of 'from the same location' makes no account of a read which partially overlaps a previous write. What are the semantics then? Here are two concrete questions:

First question

Suppose that, to set up our initial state, we have mov word [r1], 0. Then:

Core 1

mov     byte [_x], 0x01
mov     r1, word [_x]

Core 2

mov             eax, 0x0000
mov             r2, 0x0100
lock cmpxchg    word [_x], r3

Is it true that exactly one of the following must be true?

The write happens before the CAS; the CAS fails; and the read sees 0x0001.
The write happens after the CAS; the CAS succeeds; and the read sees 0x0101.

And there is no third possibility?

The write happens after the CAS; the CAS succeeds; but the read still sees 0x0001 because it happened before the write became visible to other cores.

Second question

Much more straightforward, this time: suppose that core 2 isn't writing near r1. (It might be writing elsewhere, though; else the issue is meaningless.) We just have:

Core 1

mov     byte [_x], 0x01
mov     r1, word [_x]

Must this read wait until after the write becomes visible to other cores?

These two questions are variants (but distinct variants) of the same question: if a read partially overlaps a write in the store queue, can you forward the part that's in the store queue, and serve the rest as a regular read? Or do you have to stall until the read is completed? The manual only talks about a read 'to the same location as' a write; but in this case, the read is both to the same location as the write and to a different location.

You could apply the following reasoning to this case:

A word read is notionally a pair of byte-reads, executed atomically.
If we break the read up into two reads, each one can, independently, be ordered before the write is made visible to other cores; one does not alias it, and one matches it exactly.
Having moved both of the byte reads to the other side of the write, we can recombine them into a single atomic read.

But I believe this is unsound reasoning. You can't just take an atomic operation, break it up into its pieces, apply nonatomic reasoning to the constituents, and then recombine them into an atom.

Furthermore, if this reasoning were correct, then it would seem that the answer to question 1 is no. But doesn't that violate causality?

I think the following:

If the answer to question 1 is no, then causality is violated, and so there is a contradiction.

If the answer to question 1 is yes, but the answer to question 2 is no, then a load can be served partly from the store queue and partly from cache. However, this can only be done speculatively, and must back out if another core claims the cache line exclusively between the time when the store is forwarded and the time when it is completed.

If the answer to questions 1 and 2 are both yes, then a load can be served partly from the store queue and partly from cache; but this can only be done speculatively if the core issuing the store already holds the cache line in question exclusively, and bets that no one else asks for it before the store completes.

Practically, there is very little difference between the case when only the answer to question 1 is yes, and the case when the answers to questions 1 and 2 are both yes. And no processor I know of can forward a store to a partially-overlapping load in any event. So it would seem that the answers to both questions should be yes in all cases, but this is not made explicit anywhere.

Related: [Globally Invisible load instructions](https://stackoverflow.com/q/50609934) , also [Can x86 reorder a narrow store with a wider load that fully contains it?](https://stackoverflow.com/a/35910141) is about a locking scheme that depended on that not being possible, and what Intel means by "Loads Are not Reordered with Older Stores to the Same Location" - that's highly misleading, and doesn't combine with other ordering rules to make store/reload like `mfence`. Instead, it just means the load sees the store on this core, which is obviously required for sanity. — Peter Cordes, Jan 31 '23 at 05:53
Also [Can a speculatively executed CPU branch contain opcodes that access RAM?](https://stackoverflow.com/q/64141366) re: store buffers, and many links to related stuff / [What are the costs of failed store-to-load forwarding on x86?](https://stackoverflow.com/q/46135369) (along with links in my previous comment, we think that failed store-forwarding still doesn't drain the store buffer, instead it just takes longer to scan the store buffer and assemble a load result). So that cute trick relying on that in https://outerproduct.net/boring/2023-01-27_trans-locks.html isn't safe. — Peter Cordes, Jan 31 '23 at 05:56
So what you're saying is that x86 processors will in fact indiscriminately forward stores to loads that partially overlap them, violating causality in cases such as my first question (ie, they don't do it speculatively, like I suggested later)? — Moonchild, Jan 31 '23 at 06:01
In your examples, `r1` is a weird name for a memory location. In memory-order litmus tests, normally `r` stands for register; registers are thread-private so they're what you load into or store from. x86 register names include r8-r15, but not r0-r7 (at least not normally; some assemblers maybe have an option to accept them as aliases for RAX..RDI). So it's very confusing how you use them as both labels, like `lock cmpxchg word [r1], r3` they must both be registers, but also you're using EAX in another instruction? So you're using both named and numbered regs for the "legacy" 8? — Peter Cordes, Jan 31 '23 at 06:03
I'll have a look at this question in detail tomorrow. But for now, my understanding of x86's memory model is that it's program order plus a store buffer with store-forwarding, potentially assembling a load result from cache and the store buffer before the SB entry commits. It's been this way since 486, which maybe naturally worked this way. Fun fact: old Atom CPUs (pre-Silvermont, in-order execution) had no store-forwarding stalls, even for partial overlap. — Peter Cordes, Jan 31 '23 at 06:06
Good point; I amended the example to not use 'register' names in memory addresses. I agree referring to eax is a bit confusing--cmpxchg is hardwired to use it, but I thought it would be easier to explain in terms of cmpxchg than xchg. Feel free to edit if you think it would be clearer if written another way! — Moonchild, Jan 31 '23 at 06:10
I'd just use x86 register names the whole time like ecx, edx etc, and talk about those in the text. Or if you want `r` names, use r8d, r9d, etc. — Peter Cordes, Jan 31 '23 at 06:58

What are the semantics of store forwarding under the x86 memory model?

First question

Second question

0 Answers0