Semantics of atomic stores in MESI cachelines

Question

In a concurrently read and written to line (reads and stores only). What happens when a line is owned by a core in modified-or-read mode, and some other core issues store operations on this line (assuming these reads and writes are std::atomic::load and std::atomic::store with C++ compiilers)? Does the line get pulled into the other core that is issuing the writes? Or do the writes find their way over to the reading core directly as needed? The difference between the two is that the latter only causes the overhead of one roundtrip for reading the value of the line. And can possibly get paralellized as well (if the store and read happen at staggered points in time)

This question arose when considering the consequences of NUMA in a concurrent application. But the question stands when the two cores involved are in the same NUMA node.

There are a large number of architectures in the mix. But for now, curious about what happens on Intel Skylake or Broadwell.

Peter Cordes · Accepted Answer · 2019-08-06T01:57:41.840

First of all, there's nothing special about atomic loads/stores vs. regular stores by the time they're compiled to asm. (Although the default seq_cst memory order can compile to xchg, but mov+mfence is also a valid (often slower) option which is indistinguishable in asm from a plain release store followed by a full barrier.) xchg is an atomic RMW + a full barrier. Compilers use it for the full-barrier effect; the load part of the exchange is just an unwanted side-effect.

The rest of this answer applies fully to any x86 asm store, or the store part of a memory-destination RMW instruction (whether it's atomic or not).

Initially the core that had previously been doing writes will have the line in MESI Modified state in its L1d, assuming it hasn't been evicted to L2 or L3 already.

The line changes MESI state (to shared) in response to a read request, or for stores the core doing the write will send an RFO (request for ownership) and eventually get the line in Modified or Exclusive state.

Getting data between physical cores on modern Intel CPUs always involves write-back to shared L3 (not necessarily to DRAM). I think this is true even on multi-socket systems where the two cores are on separate sockets so they don't really share a common L3, using snooping (and snoop filtering).

Intel uses MESIF. AMD uses MOESI which allows sharing dirty data directly between cores directly without write-back to/from a shared outer level cache first.

For more details, see Which cache mapping technique is used in intel core i7 processor?

There's no way for store-data to reach another core except through cache/memory.

Your idea about the writes "happening" on another core is not how anything works. I don't see how it could even be implemented while respecting x86 memory ordering rules: stores from one core become globally visible in program order. I don't see how you could send stores (to different cache lines) to different cores and make sure one core waited for the other to commit those stores to the cache lines they each owned.

It's also not really plausible even for a weakly-ordered ISA. Often when you read or write a cache line, you're going to do more reads+writes. Sending each read or write request separately over a mesh interconnect between cores would require many many tiny messages. High throughput is much easier to achieve than low latency: wider buses can do that. Low latency for loads is essential for high performance. If threads ever migrate between cores, all of a sudden they'll be read/writing cache lines that are all hot in L1d on some other core, which would be horrible until the CPU somehow decided that it should migrate the cache line to the core accessing it.

L1d caches are small, fast, and relatively simple. The logic for ordering a core's reads+writes relative to each other, and for doing speculative loads, is all internal to a single core. (Store buffer, or on Intel actually a Memory Order Buffer to track speculative loads as well as stores.)

This is why you should avoid even touching a shared variable if you can prove you don't have to. (Or use exponential backoff for cases where that's appropriate). And why a CAS loop should spin read-only waiting to see the value its looking for before even attempting a CAS, instead of hammering on the cache line with writes from failing lock cmpxchg attempts.

What about **snooping** and the description in section 11.2 (Caching Terminology) in Volume 3 (System Programming Guide) of the Intel Software Developer's Manual? That says (among other things) that the data may be passed between CPUs without being written to system memory. — 1201ProgramAlarm, Aug 05 '19 at 23:22
@1201ProgramAlarm When a cache line in the M state is evicted from the private caches of a processor/core due to a request from another processor/core, it must be written to the next *inclusive* level of the memory hierarchy so that the globally observable state of the line is not lost. The wording of Section 11.2 is written for the Pentium Pro processor where there could be multiple processors on the system bus and the next inclusive level of the hierarchy is main memory. In fact, that section does say in the last sentence that the memory controller must snoop all writebacks. — Hadi Brais, Aug 06 '19 at 00:17
The L1D follows the write-back, write-allocate write policy for all writes that are of the WB memory type, as clearly documented in Section 11.3. — Hadi Brais, Aug 06 '19 at 00:18
@1201ProgramAlarm: Like I said in my answer, Intel CPUs only require write-back as far as shared L3 cache, *not* all the way to memory. AMD's MOESI allows sharing dirty data between caches that aren't backed by a common shared cache. Intel CPUs can get data between cores without it going through DRAM, but it does have to go out to shared L3. (I think this is true even on multi-socket systems where not all cores share a common L3). — Peter Cordes, Aug 06 '19 at 00:38
@PeterCordes Thanks for the detailed answer! A followup question - for the MESIF and MOESI ways of operation, how does a core that is issuing a store find out which core owns the line in the first place? Where is this index maintained? (I guess this is a more general question and not tied to either MOESI or MESIF) — Curious, Aug 06 '19 at 00:58
@Curious: [Which cache mapping technique is used in intel core i7 processor?](//stackoverflow.com/q/49092541) On Intel other than Skylake-X the L3 tags work as a snoop filter within that socket. So the cache is always tag-inclusive even when a line is in Modified state in a private L1d cache (and thus data Invalid in L3). Otherwise it would have to broadcast a snoop to all cores. (On dual-socket systems before Skylake-X, the sockets did just snoop each other for everything. On quad-socket Xeons there's a small snoop filter cache. I assume SKX always has snoop filters.) — Peter Cordes, Aug 06 '19 at 01:03
@PeterCordes Oh, this is much more complicated than I first thought. Thanks! — Curious, Aug 06 '19 at 01:05
@Curious: Yes, that's why this answer doesn't even try to get into serious details, just a high-level picture. — Peter Cordes, Aug 06 '19 at 01:27

Semantics of atomic stores in MESI cachelines

1 Answers1