What is the point of MESI on Intel 64 and IA-32

Question

The point of MESI is to retain a notion of a shared memory system.
However, with store buffers, things are complicated:
Memory is coherent downstream of once the data hits the MESI-implementing caches.
However, upstream of that, each core may disagree on what is in memory location X, dependent on what is in each core's local store buffer.
As such, it seems like, from the viewpoint of each core, that the state of memory is different - it is not coherent.
So, why do we bother "partially" enforcing coherency with MESI?

Edit: A substantial edit was made, after some further narrowing of what was really confusing me. I have tried to keep the general notion of the question the same, to preserve the relevance of the great answers received.

@PaulA.Clayton. Right, and I understand that; but how does that answer the question? I haven't mentioned consistency - just coherency: fetching a stale cache line. — Kay, Apr 15 '18 at 17:55
@Kay, the example you gave has nothing to do with "fetching a stale cache line", it preserves cache coherence and conforms with the TSO memory ordering model (which is indeed weaker than sequential consistency, but far from being "weak"). — Leeor, Apr 15 '18 at 20:28
@PaulA.Clayton I think I confused matters by linking an example that *does* have to do with consistency. Sorry about that. — Kay, Apr 15 '18 at 20:59
@Leeor I think I confused matters by linking an example that *does* have to do with consistency. Sorry about that. — Kay, Apr 15 '18 at 21:02
Your question is still not clear after the edit. You should expand the second and third points of the question. — Hadi Brais, Apr 15 '18 at 21:16
The caches in mainstream Intel CPUs _are_ fully coherent (for the usual memory types and normal reads and writes), so the basis for your question (that they aren't) is probably invalid. Yes, the x86 model memory itself allows some limited re-ordering (loads passing earlier stores), but that's most a result of in-core reodering due to the store buffer and other out-of-order effects, and not due to incoherent caches. There is no "eventual coherency" of the type you describe. If a core does not respond "immediately" to a snoop, the snooping core must simply wait. — BeeOnRope, Apr 16 '18 at 00:15

BeeOnRope · Accepted Answer · 2018-04-20T01:06:12.580

The point of MESI on x86 is the same as on pretty much any multiple core/CPU system: to enforce cache consistency. There is no "partial coherency" used for the cache coherency part of the equation on x86: the caches are fully coherent. The possible re-orderings, then, are a result of both the coherent caching system and the interaction with core-local components such as the load/store subsystem (especially store buffers) and other out-of-order machinery.

The result of that interaction is the architected strong memory model that x86 provides, with only limited re-ordering. Without coherent caches, you couldn't reasonably implement this model at all, or almost any model that was anything other than completely weak¹.

Your question seems to embed the assumption that there are only possible states "coherent" and "everything every else". Also, there is some mixing of the ideas of cache coherency (which mostly deals with the caches specifically, and is mostly a hidden detail), and the memory consistency model which is architecturally defined and will be implemented by each architecture². Wikipedia explains that one difference between cache coherency and memory consistency is that the rules for the former applies only to one location at a time, whereas consistency rules apply across locations. In practice, the more important distinction is that the memory consistency model is the only architecturally documented one.

Briefly, Intel (and AMD likewise) define a specific memory consistency model, x86-TSO³ - which is relatively strong as far as memory models go, but is still weaker than sequential consistency. The primary behaviors weakened compared to sequential consistency are:

That later loads can pass earlier stores.
That stores can be seen in a different order from the total store order, but only by cores that performed one of the stores.

To order to implement this memory model, various parts must play by the rules to achieve it. On all recent x86, this means ordered load and store buffers, which avoid the disallowed re-orderings. The use of a store buffer results in the two re-orderings mentioned above: without allowing those, the implementation would be very restricted and probably much slower. In practice it also means fully coherent data caches, since many of the guarantees (e.g., no load-load reordering) would be very difficult to implement without that.

To wrap it all up:

Memory consistency is different than cache coherency: the former is what is documented and forms part of the programming model.
In practice, x86 implementations have fully coherent caches, which helps them implement their x86-TSO memory model, which is fairly strong but weaker than sequential consistency.
Finally, perhaps the answer you were looking for, in different words: a memory model weaker than sequential consistency is still very useful since you can program against it, and in the case you need sequential consistency for some particular operations(s) you insert the right memory barriers⁴.
If you program against a language supplied memory model, such as Java's or C++11's you don't need to worry about the hardware specifics, but rather than language memory model, and the compiler inserts the barriers required to match the language memory model semantics to the hardware one. The stronger the hardware model, the fewer the barriers required.

¹ If your memory model was completely weak, i.e., not really placing any restrictions on cross-core reordering, I suppose you could implement it directly on a non-cache coherent system in a cheap way for normal operations, but then memory barriers potentially become very expensive since they would need to flush a potentially large part of the local private cache.

² Various chips may implement in differently internally, and in particular some chips may implement stronger semantics than the model (i.e., some allowed re-orderings can never be observed), but absent bugs none will implement a weaker one.

³ This is the name given to it in that paper, which I used because Intel themselves doesn't give it a name, and the paper is a more formal definition than the one Intel gives a less formal model as a series of litmus tests.

⁴ It practice on x86 you usually use locked instructions (using the lock prefix) rather than separate barriers, although standalone barriers exist also. Here's I'll just use the term barries to refer to both standalone barriers and the barrier semantics embedded into locked instructions.

Thanks for the great response. I've restructured the question, to hopefully pinpoint precisely what has been confusing me. As you note, it's the interaction with the store buffers. — Kay, Apr 19 '18 at 11:18
I updated my answer, but your question is still a bit confused. There is no "partial enforcement of consistency with MESI" - the caches are _fully consistent_. The overall system, which is more than just the caches, however, follows a particular _memory model_ which is weaker than full _sequential consistency_. I think your question is more about the memory model supported by the hardware, and not really about MESI or caching, which are internal details of how particular chips are implemented. The guarantees are in terms of memory consistency, not cache coherence. — BeeOnRope, Apr 20 '18 at 01:08

Peter Cordes · Answer 2 · 2018-04-19T13:06:53.623

2

Re: your edit, which seems to be a new question: right, store-forwarding "violates" coherency. A core can see its own stores earlier than any other core can see them. The store buffer is not coherent.

The x86 memory ordering rules require that loads become globally visible in program order, but allows a core to load data from its own stores before they become globally visible. It doesn't have to pretend it waited and check for memory-order mis-speculation, like it does in other cases of doing loads earlier than the memory model says it should.

Also related; Can x86 reorder a narrow store with a wider load that fully contains it? is a specific example of the store buffer + store-forwarding violating the usual memory-ordering rules. See this collection of mailing list posts by Linus Torvalds explaining the effect of store-forwarding on memory ordering (and how it means that the proposed locking scheme doesn't work).

Without coherency at all, how would you atomically increment a shared counter, or implement other atomic read-modify-write operations that are essential for implementing locks, or for use directly in lockless code. (See Can num++ be atomic for 'int num'?).

lock add [shared_counter], 1 in multiple threads at the same time never loses any counts on actual x86, because the lock prefix makes a core keep exclusive ownership of the cache line from the load until the store commits to L1d (and thus becomes globally visible).

A system without coherent caches would let each thread increment its own copy of a shared counter, and then the final value in memory would come from whichever thread last flushed that line.

Allowing different caches to hold conflicting data for the same line long-term even while other loads/stores happened, and across memory barriers, would allow all sorts of weirdness.

It would also violate the assumption that a pure store becomes visible to other cores promptly. If you didn't have coherency at all, then cores could keep using their cached copy of a shared variable. So if you wanted readers to notice updates, you'd have to clflush before every read of the shared variable, making the common case expensive (when nobody has modified the data since you last checked).

MESI is like a push notification system, instead of forcing every reader to re-validate their cache on every read.

MESI (or coherency in general) allows things like RCU (Read-Copy-Update) to have zero overhead for readers (compared to single threaded) in the case where the shared data structure hasn't been modified. See https://lwn.net/Articles/262464/, and https://en.wikipedia.org/wiki/Read-copy-update. The basic idea is that instead of locking a data structure, a writer copies the whole thing, modifies the copy, and then updates a shared pointer to point to the new version. So readers are always entirely wait-free; they just dereference an (atomic) pointer, and data stays hot in their L1d caches.

Hardware-supported coherency is extremely valuable, and almost every shared-memory SMP architecture uses it. Even ISAs with much weaker memory-ordering rules than x86, like PowerPC, use MESI.

edited Apr 19 '18 at 13:06

answered Apr 16 '18 at 00:09

Peter Cordes

328,167
45
605
847

Manycore processors and GPGPUs are shared memory architectures that typically don't implement cache coherence or offer a limited form of coherence. – Hadi Brais Apr 17 '18 at 01:56
@HadiBrais: so incrementing a shared atomic on a GPGPU would need an expensive explicit barrier to ensure coherence, like @ Bee's answer suggests? I overlooked that possibility when writing this. (And when you say manycore, you're excluding Xeon Phi, right? I think it implements the x86 shared memory model.) – Peter Cordes Apr 17 '18 at 02:04
The Xeon Phi processor is coherent within itself and with other processors. The Xeon Phi coprocessor is only coherent within itself, not with other processors or coprocessors. This is in contrast to multi-socket multicore processors. – Hadi Brais Apr 17 '18 at 02:10
GPGPUs have coherent L2, but not L1. – Hadi Brais Apr 17 '18 at 02:11
Intel SCC is not hardware-coherent at all. – Hadi Brais Apr 17 '18 at 02:11
1

@PeterCordes - these incoherent architectures would typically have explicit communication instructions that could be used to synchronize specific regions of memory, so something much richer than memory barriers. Of course, you aren't generally targeting these with standard C++ atomics, so the "problem" with implementation of C++ or CPU-like atomic operations and fences doens't really arise. – BeeOnRope Apr 17 '18 at 02:58
@BeeOnRope: Ok, that makes sense. And yeah, not running normal fine-grained shared-memory multi-threaded code is what lets those architectures use a different coherence model that wouldn't be appropriate for normal x86 or PowerPC that do run workloads like that. – Peter Cordes Apr 17 '18 at 03:01
These comments are a little over my head :P. Glad to see such in-depth knowledge though. Hopefully not to minimize the great answer Peter gave, but I have restructured the question to pinpoint my area of confusion; namely, "coherence" of each core's "view of memory" when store buffers are involved. – Kay Apr 19 '18 at 11:22
So, for example, given your initial response, I would ask also, with store buffers, how would atomics work? They must cause the store buffer to be drained first, then lock the cache line(s)? – Kay Apr 19 '18 at 11:24
@Kay: If you're hoping @ BeeOnRope will see your comments, you need to use an @ to notify him. (But you can only notify one user per comment. This comment will ping you, not Bee.) – Peter Cordes Apr 19 '18 at 11:40
@Kay: You mean read-modify-write atomics? Like `xchg` or `lock add`? On x86, atomic ops are full memory barriers anyway, so yes they drain the store buffer as well as holding the cache line locked. On a weakly-ordered ISA (like most RISC), that wouldn't be necessary; LL/SC could complete successfully while there are still outstanding cache-miss stores in the store buffer. But yeah, x86's memory model wouldn't allow that. – Peter Cordes Apr 19 '18 at 11:43
1

@PeterCordes Re: Your edit: "a core is explicitly allowed to see its own stores before they become globally visible". I *think* so; see Intel SDM Vol3: 8.2.3.5. "The memory-ordering model allows concurrent stores by two processors to be seen in different orders by those two processors; specifically, each processor may perceive its own store occurring before that of the other". – Kay Apr 19 '18 at 12:56
@PeterCordes My guess this is because of what you stated, as hinted by Intel: "In practice, the reordering in this example can arise as a result of store-buffer forwarding. While a store is tempo- rarily held in a processor's store buffer, it can satisfy the processor's own loads but is not visible to (and cannot satisfy) loads by other processors." – Kay Apr 19 '18 at 12:58
@Kay: Yup, thanks for checking. That's what I thought. – Peter Cordes Apr 19 '18 at 13:01
@Kay: I remembered a previous SO question that's relevant; added a link to my answer. – Peter Cordes Apr 19 '18 at 13:07
@PeterCordes Hey Peter. Thanks for the great question and comments. Being forced to accept only 1 answer doesn't do your effort justice. – Kay Apr 21 '18 at 14:52

What is the point of MESI on Intel 64 and IA-32

2 Answers2

Linked