It's not too difficult to figure out the semantics of present CPUs by experimentation; but what I want to know is, what semantic guarantees are provided by the architectural memory model? I've read volume 3, chapter 8 of the intel SDM, and I believe it is ambiguous. Am I missing something?
The manual nominally says:
- A read which succeeds a write in program order and is not from the same location may be ordered before the write.
- A read which succeeds a write in program order and is from the same location happens after it.
It also says:
In a multiple-processor system, the following ordering principles apply:
- Memory ordering obeys causality
I'll come back to this.
Just one sentence is later devoted to store forwarding:
The processor-ordering model described in this section is virtually identical to that used by the Pentium and Intel486 processors. The only enhancements in the Pentium 4, Intel Xeon, and P6 family processors are:
- Store-buffer forwarding, when a read passes a write to the same memory location
There's also an example (§8.2.3.5); examples are not generally taken to be normative, but we'll take it at face value for now. From what I understand, the effect of the example is basically to amend #2 above to:
- If a read is from the same location as a write which it succeeds in program order, the read must happen 'after' the write in that it either must observe the write, or another write made by another core to the same location which happens after the first write; however, in the first case, the read may happen before the write is made visible to other processors.
The example is of a situation where this makes a difference; I've elided it for brevity. Another example is given here, which was also the motivation for this question.
However, this notion of 'from the same location' makes no account of a read which partially overlaps a previous write. What are the semantics then? Here are two concrete questions:
First question
Suppose that, to set up our initial state, we have mov word [r1], 0
. Then:
Core 1
mov byte [_x], 0x01
mov r1, word [_x]
Core 2
mov eax, 0x0000
mov r2, 0x0100
lock cmpxchg word [_x], r3
Is it true that exactly one of the following must be true?
The write happens before the CAS; the CAS fails; and the read sees 0x0001.
The write happens after the CAS; the CAS succeeds; and the read sees 0x0101.
And there is no third possibility?
- The write happens after the CAS; the CAS succeeds; but the read still sees 0x0001 because it happened before the write became visible to other cores.
Second question
Much more straightforward, this time: suppose that core 2 isn't writing near r1. (It might be writing elsewhere, though; else the issue is meaningless.) We just have:
Core 1
mov byte [_x], 0x01
mov r1, word [_x]
Must this read wait until after the write becomes visible to other cores?
These two questions are variants (but distinct variants) of the same question: if a read partially overlaps a write in the store queue, can you forward the part that's in the store queue, and serve the rest as a regular read? Or do you have to stall until the read is completed? The manual only talks about a read 'to the same location as' a write; but in this case, the read is both to the same location as the write and to a different location.
You could apply the following reasoning to this case:
- A word read is notionally a pair of byte-reads, executed atomically.
- If we break the read up into two reads, each one can, independently, be ordered before the write is made visible to other cores; one does not alias it, and one matches it exactly.
- Having moved both of the byte reads to the other side of the write, we can recombine them into a single atomic read.
But I believe this is unsound reasoning. You can't just take an atomic operation, break it up into its pieces, apply nonatomic reasoning to the constituents, and then recombine them into an atom.
Furthermore, if this reasoning were correct, then it would seem that the answer to question 1 is no. But doesn't that violate causality?
I think the following:
If the answer to question 1 is no, then causality is violated, and so there is a contradiction.
If the answer to question 1 is yes, but the answer to question 2 is no, then a load can be served partly from the store queue and partly from cache. However, this can only be done speculatively, and must back out if another core claims the cache line exclusively between the time when the store is forwarded and the time when it is completed.
If the answer to questions 1 and 2 are both yes, then a load can be served partly from the store queue and partly from cache; but this can only be done speculatively if the core issuing the store already holds the cache line in question exclusively, and bets that no one else asks for it before the store completes.
Practically, there is very little difference between the case when only the answer to question 1 is yes, and the case when the answers to questions 1 and 2 are both yes. And no processor I know of can forward a store to a partially-overlapping load in any event. So it would seem that the answers to both questions should be yes in all cases, but this is not made explicit anywhere.