Shortcomings of cache coherence alternative

Question

I am trying to understand why cache coherence protocols are designed the way they are. The goal of the cache coherence is to serialize reads/writes to a particular memory location across all cores.

Suppose, writes to memory location A is serialized as A1, A2, A3. Then, once a core reads value A2, it can never read A1 in future. But it can read A3 sometime in future.

I understand this is the goal of the cache coherence protocols.

The current protocols (the standard ones I studied like MSI, MESI etc.) involve communication among cores on every (or couple of) reads/writes. This introduces cache coherence traffic.

Why don't cache coherence protocols only communicate either when

evicting a dirty cache line or
another core wants to read a cache line that is dirty in some other processor or

Why cache coherence protocols are "proactive" and not "passive". The strategy I suggest, I believe, would also serialize reads/writes with respect to a particular memory location and would save needless coherency traffic.

Some are passive: see https://preshing.com/20120930/weak-vs-strong-memory-models/ and https://devblogs.microsoft.com/oldnewthing/20170817-00/?p=96835 — Erik Eidt, Apr 04 '22 at 17:02
Keep in mind writes that aren't full-line imply a read-for-ownership. You can't allow different cores to have partial writes to the same line in their cache, unless you have some per-byte dirty bit to allow some kind of merging. And still, would need some ordering. — Peter Cordes, Apr 04 '22 at 22:22
Also, real CPUs that implement MESI don't spam all cores with broadcasts all the time; they normally use a directory to track this stuff as part of resolving a cache miss. (e.g. Intel since Nehalem (i7 etc.) uses the tags of inclusive L3 shared cache to track state. — Peter Cordes, Apr 04 '22 at 22:23
@ErikEidt The links you have posted are talking about weak memory models and in general memory consistency issues. My question is about cache coherence — driewguy, Apr 05 '22 at 03:27
@PeterCordes I am sure I am missing something. But the case you mentioned could easily happen in protocols like MESI since MESI (or other protocols based on it) do not wait for acknowledgemnt of invalidate message (I have never read about acks in MESI like protocols hence this is my belief) e.g. A and B both update memory location x and x + word_size respectively and both simultaneously send invalidation messages. How is the problem solved in such protocols? — driewguy, Apr 05 '22 at 03:50
Old / simplistic descriptions of MESI talk about a shared bus that all processor snoop. That's ridiculous by modern standards; modern CPU interconnects are often a ring bus (like Intel) or a mesh network (Skylake-Xeon and later Xeons), with directory based not snoop. e.g. https://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 shows Haswell Xeons with their ring bus(es). Anyway, a core has to get exclusive ownership of a cache line *before* committing stores from its store buffer to L1d cache; that means receiving a response to its RFO (read for ownership). — Peter Cordes, Apr 05 '22 at 04:05
@PeterCordes That makes sense! Passive merging of cache line which will entail due to my strategy would be very messy. I was wondering how is the same problem solved in snooping protocols in "old/simplistic" MESI as you mentioned. Or do they have weak guarantees? And if such weak guarantees were fine, why would my strategy be wrong? I believe there is something fundamentally wrong with my strategy but could not figure it out. — driewguy, Apr 05 '22 at 04:17
Even Wikipedia does describe that RFO is a thing: https://en.wikipedia.org/wiki/MESI_protocol#Read_For_Ownership - but you're right it doesn't make it clear that a writer has to wait until it actually gets the data back, I guess because the whole article is about a snooping MESI implementation. See also [Reducing bus traffic for cache line invalidation](https://stackoverflow.com/q/62614838) where I took a stab at explaining how directory-based coherence is different. Also [The ordering of L1 cache controller to process memory requests from CPU](https://stackoverflow.com/q/38034701) — Peter Cordes, Apr 05 '22 at 04:17
MESI never lets a core store until after its obtained exclusive ownership. That's how it maintains coherency. The way that's avoided is by doing a "BusRdX" (aka RFO) as described in the wikipedia article. Note the entry in Table 1.2 where a line with a Modified (dirty) copy of a line that sees a BusRdX request by another core must place the line contents onto the bus, for that other core to read. Part of getting a line in Exclusive state is getting a copy of the latest value of the whole line, and you can't commit a store to cache until you get that. — Peter Cordes, Apr 05 '22 at 04:22
This is why an atomic RMW operation can be implemented just by not responding to an RFO until after the ending store, in a system using MESI that allows that. e.g. x86 `lock add [rdi], eax` doesn't have to lock the whole bus, [just the cache line in that one core](https://stackoverflow.com/questions/39393850/can-num-be-atomic-for-int-num), which is why all cores in the system can be doing atomic increments to separate cache lines at the same time. — Peter Cordes, Apr 05 '22 at 04:25

Shortcomings of cache coherence alternative

0 Answers0