Why do we even need cache coherence?

Question

In languages like C, unsynchronized reads and writes to the same memory location from different threads is undefined behavior. But in the CPU, cache coherence says that if one core writes to a memory location and later another core reads it, the other core has to read the written value.

Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away? Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?

Undefined behaviour is not the kernel stuff, but how the compilator will interpret your code and generate the binary — Ôrel, Oct 11 '21 at 12:15
`if the next layer up` Well, C is not necessarily "next layer up", and undefined behavior in C _only_ means that there is no _requirement_ on the behavior of the program requested by C standard - there may be requirements from other standards and specific C programs can depend on specific hardware&compiler behaviors. — KamilCuk, Oct 11 '21 at 12:22
https://softwareengineering.stackexchange.com/ might be a better fit for this Q. — hyde, Oct 11 '21 at 12:26
Memory barriers enforce ordering, not coherence. A barrier instruction only has a direct effect on the core that issues it; all other cores are oblivious to it, except that they see updates in a certain order. Cache coherence, in contrast, is a sort of broadcast mechanism "I need X" which permits cores lazily update RAM, presumably to maintain fast access and avoid redundant updates. — mevets, Oct 11 '21 at 12:35
Suppose CPU A sets byte 0 of the cache line and CPU B sets byte 15 at more or less the same time. There's no way to resolve this without cache coherence. Doing two operations will always have a race. — stark, Oct 11 '21 at 12:38
Ok, I guess I had the wrong idea of what a memory barrier is, so I edited the question. — Dan, Oct 11 '21 at 12:53
@stark That's a good point, the language does say that you can do writes at any granularity larger than a byte without disturbing adjacent memory locations — Dan, Oct 11 '21 at 13:07
Right, so in fact, standard C more or less *requires* that you have coherent caches. For a machine with incoherent caches to conform, it would have to have 1-byte cache lines. Moreover, every acquire barrier would have to invalidate the core's entire cache, and every release barrier would have to flush the whole thing. That would be prohibitively expensive. — Nate Eldredge, Oct 11 '21 at 13:22
Cache coherence is more than just the same memory location. Cache coherence is what lets you write one memory location on one core and the *next* memory location on a different core! — user253751, Oct 11 '21 at 13:32
@stark that's a great example. OP question also bothered me for quite some time and I was about to post the same question. But reading your answer makes it unnecessary. — Weipeng, Nov 24 '22 at 20:37

Peter Cordes · Accepted Answer · 2021-10-11T20:44:26.727

The acquire and release semantics required for C++11 std::mutex (and equivalents in other languages, and earlier stuff like pthread_mutex) would be very expensive to implement if you didn't have coherent cache. You'd have to write-back every dirty line every time you released a lock, and evict every clean line every time you acquired a lock, if couldn't count on the hardware to make your stores visible, and to make your loads not take stale data from a private cache.

But with cache coherency, acquire and release are just a matter of ordering this core's accesses to its own private cache which is part of the same coherency domain as the L1d caches of other cores. So they're local operations and pretty cheap, not even needing to drain the store buffer. The cost of a mutex is just in the atomic RMW operation it needs to do, and of course in cache misses if the last core to own the mutex wasn't this one.

C11 and C++11 added stdatomic and std::atomic respectively, which make it well-defined to access shared _Atomic int variables, so it's not true that higher level languages don't expose this. It would hypothetically be possible to implement on a machine that required explicit flushes/invalidates to make stores visible to other cores, but that would be very slow. The language model assumes coherent caches, not providing explicit flushes of ranges but instead having release operations that make every older store visible to other threads that do an acquire load that syncs-with the release store in this thread. (See When to use volatile with multi threading? for some discussion, although that answer is mainly debunking the misconception that caches could have stale data, from people mixed up by the fact that the compiler can "cache" non-atomic non-volatile values in registers.)

In fact, some of the guarantees on C++ atomic are actually described by the standard as exposing HW coherence guarantees to software, like "write-read coherence" and so on, ending with the note:

http://eel.is/c++draft/intro.races#19

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note

(Long before C11 and C++11, SMP kernels and some user-space multithreaded programs were hand-rolling atomic operations, using the same hardware support that C11 and C++11 finally exposed in a portable way.)

Also, as pointed out in comments, coherent cache is essential for writes to different parts of the same line by other cores to not step on each other.

ISO C11 guarantees that a char arr[16] can have arr[0] written by one thread while another writes arr[1]. If those are both in the same cache line, and two conflicting dirty copies of the line exist, only one can "win" and be written back. C++ memory model and race conditions on char arrays

ISO C effectively requires char to be as large as smallest unit you can write without disturbing surrounding bytes. On almost all machines (not early Alpha and not some DSPs), that's a single byte, even if a byte store might take an extra cycle to commit to L1d cache vs. an aligned word on some non-x86 ISAs.

The language didn't officially require this until C11, but that just standardized what "everyone knew" the only sane choice had to be, i.e. how compilers and hardware already worked.

Great answer! Could you also elaborate a bit further on why lock acquire release would be much more expensive without cache coherence protocol? My understanding is that fundamentally cache coherence prorotol let cores be able to determine the state of its private cache at any moment, such that when a store release happens, it can know that e.g. hey these cache lines are in (E) xclusive state so I dont need to flush them at all, hence the efficiency. Did I get this right? — Weipeng, Nov 24 '22 at 20:33
@WeipengL: Sort of but not exactly. Cache coherence means that inter-thread visibility is achieved just by controlling local ordering (of stores committing to L1d cache, and loads reading from L1d). No flushing at all is ever required for a release operation itself. A cache line only has to get written-back (or transferred directly core-to-core) when another core actually loads a line that this core has in Modified state. — Peter Cordes, Nov 25 '22 at 03:33
@WeipengL: But without hardware cache coherence, a release operation would have to manually sync the whole L1d cache to a shared level of cache (like L3 in modern x86, or DRAM in a multi-socket system where not all cores share the same L3). Of course if you were designing a system to require explicit flushes, you might provide some hardware assistance, like recording a set of lines between a start point and an end point, to let software only flush lines they touched while holding a lock. But it would still mean some write-backs have to happen before a release-store can become visible. — Peter Cordes, Nov 25 '22 at 03:36
@WeipengL: Also it would mean that atomic relaxed stores would have to flush their own cache line either right away or at some upcoming sync point to become visible across threads. So repeated stores by the same thread to the same object would always have to retake ownership of it, instead of just hitting in L1d cache like real hardware can. See also [Why set the stop flag using \`memory\_order\_seq\_cst\`, if you check it with \`memory\_order\_relaxed\`?](https://stackoverflow.com/q/70581645) (no reason, that's not helpful even to reduce latency.) — Peter Cordes, Nov 25 '22 at 03:38
@WeipengL: See also https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ which makes an analogy of coherent cache being like a shared server, and individual cores pushing/pulling with some control over local reordering. — Peter Cordes, Nov 25 '22 at 03:44
Thanks a lot for the extra details. Interestingly I have read this piece by preshing's before. Now that I have re-read it, it makes me think: is it fair to say that the effect of having cache coherence is to create an illusion that all stores/loads are immediately effective to all cores, such that each cores only need to take care of the **order** (instead of order AND visibility). A separation of concern, so to speak. — Weipeng, Dec 02 '22 at 03:18
OTOH, is it also fair to say that it's theoretically conceivable to design an architecture without cache coherence protocol that ends up being more performant? Take the following scenario for example: N worker threads each doing some work and increment a global counter to tally total work done. What's special is they only care about a very rough tally, i.e. a big error margin is tolerable. In this case without cache coherence protocol, no hardware sync is needed, data-race is fine, performance is best, right? But with coherence, we get "the benefit" no matter whether we need it or not. — Weipeng, Dec 02 '22 at 03:39
@WeipengL: Yes, with coherent cache, each core just needs to order its own accesses to globally visible cache. That's sufficient to recover sequential consistency if you want it, or acquire/release. As for cases where you wouldn't use coherent caches, GPUs often don't. I don't think a shared counter example makes sense. If you don't need it to be accurate, just have threads only increment the global counter every 100, or have them each increment their own and readers loop over an array of counts. (In separate lines). — Peter Cordes, Dec 02 '22 at 05:31

score 1 · Answer 2 · answered Dec 26 '22 at 08:40

Ah, a very deep topic indeed!

Cache coherency between cores is used to synthesise (as closely as possible) and Symetric Multi Processing (SMP) environment. This harks back to the days when multiple single core CPUs were simply tagged on to the same single memory bus, circa mid 1990s, caches weren't really a thing, etc. With multiple CPUs with multiple cores each with multiple caches and multiple memory interfaces per CPU, the synthesis of an SMP-like environment is a lot more complicated, and cache-coherency is a big part of that.

So, when one asks, "Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away?", one is really asking "Do we still need an SMP environment?".

The answer is software. An awful lot of software, including all major OSes, has been written around the assumption that they're running on an SMP environment. Take away the SMP, and we'd have to re-write literally everything.

There are now various sage commentators beginning to wonder in articles whether SMP is in fact a dead end, and that we should start worrying about how to get out of that dead end. I think that it won't happen for a good long while yet; the CPU manufacturers have likely got a few more tricks to play to get ever increasing performance, and whilst that keeps being delivered no one will want to suffer the pain of software incompatibility. Security is another reason to avoid SMP - Meltdown and Spectre exploit weaknesses in the way SMP has been synthesised - but I'd guess that whilst other mitigations (however distasteful) are available security alone will not be sufficient reason to ditch SMP.

"Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?" Why not, indeed? We have been there before. Transputers (1980s, early 1990s) implemented Communicating Sequential Processes (CSP), where if the application needed a different CPU to process some data, the application would have to purposefully transfer data to that CPU. The transfers are (in CSP speak) through "Channels", which are more like network sockets or IPC pipes and not at all like shared memory spaces.

CSP is having something of a resurgence - as a multiprocessing paradigm it has some very beneficial features - and languages such as Go, Rust, Erlang implement it. The thing about those languages' implementations of CSP is that they're having to synthesise CSP on top of an SMP environment, which in turn is synthesised on top of an electronic architecture much more reminiscent of Transputers!

Having had a lot of experience with CSP, my view is that every multi-process piece of software should use CSP; it's a lot more reliable. The "performance hit" of "copying" data (which is what you have to do to do CSP properly on top of SMP) isn't so bad; it's about the same amount of traffic over the cache-coherency connections to copy data from one CPU to another as it is to access the data in an SMP-like way.

Rust is very interesting, because with it's syntax strongly expressing data ownership I suspect that it doesn't have to copy data to implement CSP, it can transfer ownership between threads (processes). Thus it may be getting the benefits of CSP, but without having to copy the data. Therefore it could be very efficient CSP, even if every thread is running on a CPU single core. I've not yet explored Rust deeply enough to know that that is what it's doing, but I have hopes.

On of the nice things about CSP is that with Channels being like network sockets or IPC pipes, one can readily implement CSP across actual network sockets. Raw sockets are not in themselves ideal - they're asynchronous and so more akin to Actor Model (as is ZeroMQ). Actor Model is fairly OK - and I've used it - but it's not as guarateed devoid of runtime problems as CSP is. So one has to implement the CSP bit oneself or find a library. However, with that in place CSP becomes a software architecture that can more easily span arbitrary networks of computers without having to change the software architecture; a local channel and a network channel are "the same", except the network one is a bit slower.

It's a lot harder to take a multithreaded piece of software that assumes SMP, uses semaphores, etc to scale up across multiple machines on a network. In fact, it can't, and has to be re-written.

More recently than Transputers, the Cell processor (Playstation 3 fame) was a multi-core device that did exactly as you suggest. It had a single CPU core, and 8 SPE maths cores each with 255k on-chip core-speed static RAM. To use the SPEs you had to write software to ships code and data in and out of that 256k (there was a monster-fast internal ring bus for doing this, and a very fast external memory interface). The result was that, with the right developer, very good results could be attained.

It took Intel about a further 10 years to usefully get x64 up to about the same performance; adding in a Fused Multply-Add instruction into SSE was what finally got them there, an instruction they'd been keeping in Itanium's repetoire in the vain hope of boosting its appeal. Cell (the SPEs were based in the PowerPC equivalent of SSE - Altivec) had had an FMA instruction from the get-go.

Caches were very much a thing in the mid 90s, especially for processors that were on the high end of what tech at the time could do (otherwise you'd usually just buy one faster CPU instead of putting together multiple in one system). Maybe you're thinking of earlier, like 386 SMP systems (with no cache and minimal CPU pipelining, its memory model was sequentially consistent!). Or even earlier historical machines back to 1962. https://en.wikipedia.org/wiki/Symmetric_multiprocessing#History - early Unix SMP included 1984 Sequent Balance 8000 with "small write-through caches" and shared memory. — Peter Cordes, Dec 26 '22 at 08:49
Meltdown fully applies to single-core systems. It's about the CPU protecting the OS on that core from user-space. (The user/supervisor bit in a page-table entry). The problem is a wrong assumption that any microarchitectural state changed by speculative (and mis-speculated) execution isn't relevant as long as secrets never become architectural state directly. Spectre also applies to single-core systems, again for protecting the kernel against user-space (especially on system calls), or context-switches to other user-space. It's about training branch predictors, which are per-core. — Peter Cordes, Dec 26 '22 at 08:54
@PeterCordes, yes, I am referring to those early 386 SMP systems :-) I can remember people buying them, getting expensive MatLab licenses, and wondering why the performance wasn't magically twice as good! — bazza, Dec 26 '22 at 08:55
@PeterCordes, re Meltdown / Spectre, it was the use of cache sidechannels combined with them to which I was referring. No cache (e.g, like Cell), no cache side channels. Or at least, that's what I'd hope for! — bazza, Dec 26 '22 at 09:03
http://hw-museum.cz/article/5/cpu-history-tour--1995---1999-/3 for CPU dates. https://en.wikipedia.org/wiki/Pentium_Pro#Bus_and_multiprocessor_capabilities - PPro was multi-socket SMP capable from launch (Nov 1995), so we're talking about a P6 core with *2 levels* of write-back cache (although L2 was only packaged with it, not actually on die in the initial version). I didn't find a data for 386 SMP, or when SMP systems started having cache, but mainstream RISC chips like MIPS all had some cache. — Peter Cordes, Dec 26 '22 at 09:03
Not having cache is unique to Cell SPEs, they have fast local memory instead. Any CPU for general-purpose computing is going to have cache; that's not an artifact of SMP support, only *coherence* is. For example, GPUs have non-coherent cache; look at that if you want a real example of what non-cc NUMA (https://en.wikipedia.org/wiki/Non-uniform_memory_access#Cache_coherent_NUMA_(ccNUMA) could be like). (Also, Spectre and Meltdown could exploit other side-channels, like ALU timing. The most common / best side-channel is cache-timing, but that's a pretty indirect relationship to SMP.) — Peter Cordes, Dec 26 '22 at 09:07
@PeterCordes there are also VME systems - multiple processor cards, separate memory cards, other peripherals, all on the single VME bus. Amazingly, VME is still current! Generally in VME system's you'd set processor cards up to not cache memory on the VME bus. Cell did point to a possible future of cacheless general purpose compute, but no one was interested. Pity! Still, it was a very impressive chip for the time. That architecture, updated and in a modern Si process, would still beat anything else out there today. — bazza, Dec 27 '22 at 09:02
Yeah Cell is good for the limited set of problems that work well with a small memory footprint that *can* be manually managed efficiently. Often that means DSP type tasks, vector processing over arrays. I suspect that for some problems, like large hash tables, having hardware adaptively cache frequently-used items is hard to beat. — Peter Cordes, Dec 27 '22 at 09:27
More modern incarnations of fast local memory include Xeon Phi, where the 6 to 16GiB of very high bandwidth MCDRAM could be configured as plain transparent last-level cache, or as architecturally-visible "local" scratchpad memory. Dr. Bandwidth says it's common for DSPs in general to have fast local SRAM that can be used as scratchpad memory [Can you directly access the cache using assembly?](https://stackoverflow.com/a/62225075) — Peter Cordes, Dec 27 '22 at 09:27

Prabhu U · Answer 3 · 2022-12-29T03:38:57.227

Cache coherency is not needed if a developer takes care of issuing lock(+ memory barriers) / (mem. barrier)unlock irrespective of it.
Cache coherency is of little value, or even has a negative value in terms of cost, power, performance, validation etc.
Today, software is more and more distributed. Any way, coherency can't help two processes running on two different machines.
Even for multi-threaded SW, we end up using IPC.
only small part of data is shared in multi threaded sw.
A large part of data is not shared, if shared, memory barriers should solve cache syncing.
practically, SW developers depend on explicit locks to access shared data. These locks can efficiently flush the caches with h/w assistance (efficient means, only the caches lines that are modified AND also cached else where). And already, this is exactly done when we lock/unlock. Since every does above said lock/unlock, then Cache coherency is redundant and wastage of silicon space/power, hw engineers sleep.
all compilers(at least C/C++, python VM ) generate code for single threaded, non shared data. If I need to share the data, I just tell it is shared, but not how and why (volatile?). Developers need to take care of managing (again lock/unlock across hw cores/sw threads/hw threads). Most of the time, we write in HLL with non-atomic data. Cache-coherency does not add any value to developers, so, he/she fall back to managing it through locks, which instruct the cache system to efficiently flush. All caches systems have logs of cache lines to flush efficiently w/ or w/o coherency support. (think of cached but non coherent memory. this memory still has logs which can be used for efficient flushing)
Cache coherency very complex on silicon, consuming space and power. In any case, SW developers takes care of issuing memory barriers (via locks).

So, I think, it is good to get rid of coherency and tell developers to own it.

But I see, trend is opposite. Look at CXL memory etc... It is coherent.

I am looking for a system call where I can just turn off the cache coherency for my threads and see experiment

Cache coherency is what makes locks and memory barriers sufficient on CPUs. You'd need different kinds of operations (like explicit flushes of shared data) *as well* as locking and/or barriers for inter-thread visibility if you didn't have coherent caches, as I said in my answer. — Peter Cordes, Dec 26 '22 at 07:45
When you say "logs of which lines to flush", you mean dirty bits? Cache coherency avoids write-back for lines that other cores aren't actually reading, letting them stay dirty in the private L1d cache of the core that's been writing them. Those do need to be written back if/when they're ever evicted, but that doesn't need to happen on every lock/unlock. (So fine-grained locking, taking/releasing a lock around an individual access, is less bad than it would be if data had to be flushed to a shared / globally-visible level of cache on every unlock.) — Peter Cordes, Dec 29 '22 at 04:13
(your edit is an improvement, though; it does imply a somewhat plausible mechanism where either compilers would treat lock and unlock specially and flush all cache lines stored to inside a critical section. Or `unlock` would flush all dirty data in the whole private cache.) — Peter Cordes, Dec 29 '22 at 04:21

Why do we even need cache coherence?

3 Answers3