How do cores decide which cache line to invalidate in MESI?

Question

I have some misunderstanding about cache lines. I'm using Haswell and Ubuntu. Now let's say we have 2-threaded application in which the following happens.

mov [addr], dword 0xAC763F
;starting Thread 1 and Thread 2

Now let`s say the threads perform the following actions in parallel:

Thread 1                        Thread 2
mov rax, [addr]              mov rax, [addr]
mov [addr], dword 1     mov [addr], dword 2

Now in my understanding of what's going on is this:

Before starting the main thread writes to the corresponding cache line (addr) and marks it as Exclusive.
If both of the threads Thread 1 and Thread 2 finished reading before wrtining is started then the cache line has the state Shared in all caches.

Now I don't understand the line of which cache is marked as Invalid if both of mov [addr], dword 1 in Thread 1 and mov [addr], dword 2 in Thread 2 is happened "at the same time".

First of all "at the same time" feels a bit blurred. I think of this as "during the same CPU clock cycle". How does MESI protocol implemetation solves this "the same time writing from different threads problem".

On a bus (the topology for which MESI was developed), there is no “at the same time.” All accesses are serialised, if two cores want to access the bus at the same time, a component called a *bus arbiter* gives one of the two cores the right to proceed first. — fuz, Feb 25 '18 at 00:42
Note that Haswell is a bad example for this as it doesn't have a bus topology and uses MESIF instead of MESI as far as I know. — fuz, Feb 25 '18 at 00:44
@fuz Thanks for.the reply. May I ask you if there is a tool to trace coherence traffic? Can we.use.perf for that? If so which kind of event I should acquire counters? — Linda Turasco, Feb 25 '18 at 00:48
@fuz You mentioned that all access to the Bus are serialized. But is a bus transaction required if we are writing to a cache.line? Can.You expand a bit? — Linda Turasco, Feb 25 '18 at 00:52
@fuz If all accesses are serialized I would expect some "bus"-queue for.cache events... — Linda Turasco, Feb 25 '18 at 00:53
I think you can look at the various _RFO_ counters to get an idea. Depending on cpu model you might also have _L2 cache lines in I/S/E state_ counters. — Jester, Feb 25 '18 at 01:31
There isn't a bus transaction to write to a cache line, but there certainly is to acquire exclusive access to it. — prl, Feb 25 '18 at 04:06
@LindaTurasco It depends on what state the cache line is in. A bus transaction is required to read the cache line in I state or to make it exclusive in S state. In M and E states, clearly no bus transaction is required. — fuz, Feb 25 '18 at 11:30

Peter Cordes · Accepted Answer · 2018-02-25T06:06:18.017

I think of this as "during the same CPU clock cycle"

Different cores can use different clocks; e.g. one core can be at 4GHz while another is at 800MHz. (Only for Haswell Xeon, though; the dual / quad-core parts have all cores in a single clock domain. I've read that, and it matches what you see from looking at CPU frequency on idle cores while one core is busy.)

Related: What happens when different CPU cores write to the same RAM address without synchronization? is a very similar question. (But the OP of that question doesn't know what MESI is). Still, I went into more detail there about sending out RFO requests, so maybe read that answer if this one is too terse.

Before starting the main thread writes to the corresponding cache line (addr) and marks it as Exclusive.

A core has to get the cache line in Exclusive state before it can modify it. Actually committing a write to L1D flips it from Exclusive to Modified with no communication with other cores. (L1D and L2 are write-back caches).

But yes, if both cores read the cache line before either of them writes, they will both have the line in Shared state. They can only flip the line to Exclusive after receiving a successful reply to an RFO request for that line. Wikipedia's MESI article has a diagram of state transitions, and a definition of RFO.

It's certainly possible for conflicting RFO requests to be in flight at once. They take many cycles to arrive at another core, so there's plenty of time for stores on different cores to each initiate an RFO before either receives the RFO. (Not that that would stop a core from sending its own RFO; writing to either an Invalid of Shared line needs an RFO to get it to Exclusive state so the store can commit.)

I'm not 100% sure that the decision of which request wins would be decided in L3 cache. But

Haswell's L3 is inclusive, and used as a backstop / filter for coherency traffic. Instead of actually broadcasting all requests to all cores, L3 is tag-inclusive with extra info to track which cores (might) have copies of which line. L1 and L2 are private per-core, so L3 is the first shared level of cache.

I'm guessing that L3 handles arbitration for which core's RFO completes first, because it's already keeping track of which cores (might) need to see which RFOs. Presumably this is done in the slice of L3 that holds the relevant physical address.

As @fuz points out, MESI is designed around a bus topology, not a more complex network where messages are routed. Intel's design has the same states, but doesn't literally have to work as simply as the usual CPU-architecture descriptions say.

So, what we can say for sure is: via some unknown internal mechanism, the CPU decides that one RFO was first. A later one arriving while the first one is still doing a round-trip might be cancelled (so that core has to retry later), or it might be buffered.

We do know that Intel CPUs have a hardware arbitration mechanism for contending atomic RMW operations, like lock add [mem], eax. Presumably it's exactly the same mechanism that arbitrates multiple write-only accesses to the same line, because the only difference is that a locked operation holds onto the line for the duration of the operation, not responding to invalidate requests for the duration.

You can talk about multiple RFO requests arriving at the same slice of L3 in the same clock cycle of the "uncore" clock that the L3 uses.

That is probably possible in practice on Haswell; cores are connected with a bi-directional ring bus (32-bytes wide each way), so two messages per (uncore) clock cycle can arrive at any given slice of L3 cache. Also, each slice of L3 is connected to a core, so a request from that core could also arrive at the same time.

In that case, it's probably simple: if a slice even can receive multiple messages destined for it (rather than just passing through on the ring) in the same cycle, it's probably hard-wired so one of the three sources for that slice always wins.

But could you please clarify about when the changes are being sent to main memory? Seems like `L3` is enough. Do we need to use to `clflushopt` explicitly? — Linda Turasco, Feb 25 '18 at 09:25
@LindaTurasco: You only need `clflushopt` if you care about data being in RAM vs. cache. e.g. if you have non-volatile DIMMs (https://en.wikipedia.org/wiki/NVDIMM). Data is written to memory when it's evicted from L3 (or with NT stores, or stores to uncacheable / uswc / write-through memory regions or pages). "normal" MESI would write-back to memory before another cache can load it (i.e. it can't transfer a dirty line between caches), but Intel actually lets L3 backstop coherency traffic. i.e. real CPUs let dirty lines go up the hierarchy to outer caches. Simple MESI is for one-level caches — Peter Cordes, Feb 25 '18 at 09:34
According to this, somewhat old, [document](https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf) there is a Global Queue in the uncore (that together with the L3 forms the Caching Agent). It handles the requests from the SuperQueues in the cores, the Home Agent and the QPI (maybe UPI). I speculate it is probably a FIFO queue with a PCI GNT# like arbitration (that generates backpressure to all the sources) and something similar is still present today. The ring is not depicted in the linked doc, it may be too old or too high level. — Margaret Bloom, Feb 25 '18 at 15:58
In that case, the first RFO is queued before. I don't think the other eventual RFOs need to be cancelled, not sure though. Anyway if there is really a HW contention somewhere, it will be resolved by an hardwired logic, like you said. Just sharing the doc in case you found it interesting :) — Margaret Bloom, Feb 25 '18 at 15:59

How do cores decide which cache line to invalidate in MESI?

1 Answers1

Linked