How does cache coherence work in multi-core and multi-processor architecture?

Question

Let me explain my understanding and ask you to either confirm its correctness or correct me:

There's a MESI protocol which allows for efficient cache coherence (https://en.wikipedia.org/wiki/MESI_protocol). It's the state of the art mechanism.
For several cores of a single processor, MESI operates via L3 cache which is shared among cores of a processor.
For several processors (with no shared L3), MESI operates via Main Memory.
When using global variables, which are read and written by several threads, volatile type specifier is used to prevent unwanted optimizations as well as to prevent caching in registers (not in L1-3 caches). Thus, if value is not in a register but in cache or main memory, MESI would do its work to make threads see correct values of globals.

Why you are making such an unholy mix of low-level hardware mechanisms and language-specific constructs? Submit two questions for two very different topics. And no, using 'volatile' is not correct there. — SergeyA, Feb 19 '16 at 17:56
Re: C/C++ `volatile` - yes, it compiles similarly in practice to `std::atomic` with `memory_order_relaxed`. See [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) for details, and why you should use std::atomic instead. (And the fact that cache-coherency is what lets `std::atomic` work without flush-to-shared-mem instructions.) You're 100% correct that "caching in registers" is the only kind of problematic caching; it's a common misconception that CPU caches can be stale. (Memory reordering is still a thing, of course.) — Peter Cordes, Mar 11 '22 at 05:19

David Schwartz · Answer 1 · 2016-02-19T18:08:02.487

2

For several cores of a single processor, MESI operates via L3 cache which is shared among cores of a processor.

MESI operates at all cache levels. In some processor designs, the L3 cache serves as an efficient "switchboard" between cores. For example, if the L3 cache is inclusive and holds everything in any CPU's L1 or L2 caches, then just knowing that something isn't in the L3 cache is enough to know it's not in any other core's cache. This can reduce the amount of snooping needed. These are sophisticated optimizations though.

For several processors (with no shared L3), MESI operates via Main Memory.

I'm not sure what you're trying to say here, but it doesn't seem to correspond to anything true. MESI operates between caches. Memory isn't a cache and so has no need to participate in the MESI protocol.

You could mean that for CPUs without an L3 cache, the L2 inter-cache MESI traffic occurs on the same CPU bus as the one that connects to main memory. This used to be true for some multi-chip CPU designs before CPUs had on-chip memory controllers. But today, most laptop/desktop multi-core CPUs have on die memory controllers, so the bus that connects to memory only connects to memory. So there's no MESI traffic there. If data is in one core's L2 cache and has to get to another core's L2 cache, it doesn't go over the memory. (Think about the topology of the cores and the memory controller, that would be insane.)

When using global variables, which are read and written by several threads, volatile type specifier is used to prevent unwanted optimizations as well as to prevent caching in registers (not in L1-3 caches).

I know of no language where this is true. It's certainly not true in C/C++ where volatile is for things like signals not multithreading (at least on platform's with well-defined multi-threading APIs). And it's not true for things like Java where volatile has specific language semantics that have nothing to do with registers.

Thus, if value is not in a register but in cache or main memory, MESI would do its work to make threads see correct values of globals.

This could be true at the hardware/assembler level. That's where registers exist. But in practice it's not because while MESI makes the memory caches coherent, modern CPUs have other optimizations that create the same kinds of problems. For example, a CPU might prefetch a read or might delay a write out of order. So you need things like memory barriers in addition to MESI. This, of course, gets very platform specific.

You can think of MESI as an optimization. You still have to do whatever the platform requires in order for inter-thread memory visibility to work correctly. But MESI tremendously reduces what that work is.

Without MESI, for example, you might have a design where the only way for data to get from one core to another is through a write to main memory followed by waiting for the write to complete followed by a read from main memory. That would be a total disaster. First, you'd wind up having to flush things to main memory just in case another thread needed it. And second, all this traffic would choke out the regular memory traffic. Yuck.

edited Feb 19 '16 at 18:08

answered Feb 19 '16 at 17:57

David Schwartz

179,497
17
214
278

behold Visual Studio :) Btw, why are you saying volatile is for signals? Has nothing to do with them. – SergeyA Feb 19 '16 at 18:01
1

@SergeyA Actually, that's what `volatile` is for in C and C++. It's used to let the compiler know that a signal handler might modify a variable's value. See, for example, [this page](http://en.cppreference.com/w/c/program/sig_atomic_t). It's sometimes abused/misused for multi-threading with varied results. – David Schwartz Feb 19 '16 at 18:04
2

No, `volatile` is not for that. `volatile` is to instruct compiler to treat variable access as a file access - all reads must be real reads, all writes must be real writes. (otherwise 'x = 5; x = 6;') will be optimized by any compiler. Signals might (or might not) be one of the applicable scenarios, but there are plenty of others as well. – SergeyA Feb 19 '16 at 18:10
@SergeyA That may be true on some platforms. But there's no platform-independent notion of a "real write". So it's not at all clear what can and can't be optimized away. Those semantics are only useful on some platforms where you have platform-specific knowledge that they happen to do what you want. They have nothing to do with the semantics in the standard which are too vague. If you think otherwise, explain what a "real write" is without referring to anything platform specific. – David Schwartz Feb 19 '16 at 18:11
Also, I showed a link to a portable use of `volatile`. Why would you say that `volatile` is "not for that"?! – David Schwartz Feb 19 '16 at 18:13
1

It is true on ALL platforms. volatile is a compiler-level construct which prevents certain optimizations (namely, skipping reads and writes). Compiler **must** issue a memory reading (or writing) instruction exactly as many times as there are volatile variable access. That's it. – SergeyA Feb 19 '16 at 18:13
That there are "memory reading (or writing) instructions" is a platform-specific thing. Nothing in the standard requires C or C++ code to be compiled into instructions. Also, that would be an incoherent requirement because any optimization the compiler could make could also be made by the CPU. So that would have no dependable consequences anyway. – David Schwartz Feb 19 '16 at 18:15
Because volatile was certainly not invented for that. it was invented for a very different thing. And yes, would be used here as well (in C, though). In modern C++, even in this scenario different technique should be used. – SergeyA Feb 19 '16 at 18:15
@SergeyA The `volatile` keyword was invented for signal handlers and as a standard way to get platform-specific behavior needed in things like hardware drivers that are always written with intimate platform-specific knowledge of what `volatile` happens to do on that platform. – David Schwartz Feb 19 '16 at 18:16
2

Standard requires compilers to treat volatile variable access as a file access. Usually it means writing/reading instructions, but could be something else. Doesn't matter. And obviously, the CPU optimization things would be in the hands of someone who used volatile variable. For example, it might be mapped to a device memory (actually, the most prominent usage of volatile) - and in this case CPU will also know not to optimize it. – SergeyA Feb 19 '16 at 18:17
Yes, exactly. It has platform-specific semantics that you can use on particular platforms where you know they happen to do what you happen to want. I agree. They're largely obsolete anyway. – David Schwartz Feb 19 '16 at 18:18
No, it does not have platform-specific semantics. It has very well defined sematics - which would be depedent on platform. – SergeyA Feb 19 '16 at 18:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/103981/discussion-between-david-schwartz-and-sergeya). – David Schwartz Feb 19 '16 at 18:18
Chat is closed for me :). – SergeyA Feb 19 '16 at 18:19
Just for the record, `volatile` *does* require the compiler to re-read a value from memory (via the cache hierarchy), not reuse a value in a register, **when compiling for a register machine**. That part of what you quoted is accurate for real implementations. Of course ISO C has nothing to say about it, but before C11 it was the only way to do multi-threading. And [it does still work *in practice* (because we compile for machines with coherent caches)](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi/58535118#58535118), even though it's not recommended. – Peter Cordes Mar 11 '22 at 05:10
I think it's useful to acknowledge and point out the set of assumptions / circumstances under which that's correct, rather than just being obtuse and calling it a nonsense claim. It's a [mesi] question, not a C question, so that part of the question is pretty random and different from the actual HW / CPU-architecture parts, though. MESI is what lets `volatile` work somewhat like `std::atomic` with memory_order_relaxed`. (And lets that work without any special flush-to-shared-L3 instructions, ensuring that threads *eventually* see correct values.) – Peter Cordes Mar 11 '22 at 05:13
@PeterCordes I think any opinion of how `volatile` and MESI must interact is just one person's opinion. The standard is simply not specific enough to allow anyone to say that it *requires* a compiler re-read through the cache hierarchy. For one thing, why is using the cache hierarchy allowed? Nothing in the standard weighs one way or the other, it's just one way makes more sense for the hardware. But you have no guarantee compilers won't do things that didn't make sense to you. I've been bitten many times by newer compilers doing things that made no sense (or weren't practical) on older ones. – David Schwartz Mar 11 '22 at 09:32
If you use `volatile` the same way the Linux kernel does (for atomics), you can be pretty sure GCC and clang won't break your code. Nor will most other compilers. Obviously I don't recommend it in practice, but it's unhelpful to claim there's nothing that can be usefully said about it. Clearly they're talking about mainstream compilers that de-facto supported `volatile` for multi-threading before C++11 existed, and still maintain whatever semantics that had, both for old codebases and for the Linux kernel. (https://lwn.net/Articles/793253/) – Peter Cordes Mar 11 '22 at 09:45
ISO C and C++ *do* guarantee that a volatile access is a visible side-effect, and does have to actually happen in the target machine, whatever that means. (I forget the exact language, but you're right that it doesn't mention registers or cache. But when the target *is* a mainstream CPU ISA with coherent cache, the de-facto understanding is that it means a load or store instruction in the asm, without cache-flush instructions. Because that makes sense, and MMIO regions are normally uncacheable) – Peter Cordes Mar 11 '22 at 09:46
That sounds like a "this is what happens" rather than a "this is what the standard says must happen" to me. It's definitely not something you can rely on it. Compilers do it because it works well on their platforms, not because the standard requires it. – David Schwartz Mar 12 '22 at 06:39

How does cache coherence work in multi-core and multi-processor architecture?

1 Answers1