First of all, there's nothing special about atomic loads/stores vs. regular stores by the time they're compiled to asm. (Although the default seq_cst
memory order can compile to xchg
, but mov
+mfence
is also a valid (often slower) option which is indistinguishable in asm from a plain release store followed by a full barrier.) xchg
is an atomic RMW + a full barrier. Compilers use it for the full-barrier effect; the load part of the exchange is just an unwanted side-effect.
The rest of this answer applies fully to any x86 asm store, or the store part of a memory-destination RMW instruction (whether it's atomic or not).
Initially the core that had previously been doing writes will have the line in MESI Modified state in its L1d, assuming it hasn't been evicted to L2 or L3 already.
The line changes MESI state (to shared) in response to a read request, or for stores the core doing the write will send an RFO (request for ownership) and eventually get the line in Modified or Exclusive state.
Getting data between physical cores on modern Intel CPUs always involves write-back to shared L3 (not necessarily to DRAM). I think this is true even on multi-socket systems where the two cores are on separate sockets so they don't really share a common L3, using snooping (and snoop filtering).
Intel uses MESIF. AMD uses MOESI which allows sharing dirty data directly between cores directly without write-back to/from a shared outer level cache first.
For more details, see Which cache mapping technique is used in intel core i7 processor?
There's no way for store-data to reach another core except through cache/memory.
Your idea about the writes "happening" on another core is not how anything works. I don't see how it could even be implemented while respecting x86 memory ordering rules: stores from one core become globally visible in program order. I don't see how you could send stores (to different cache lines) to different cores and make sure one core waited for the other to commit those stores to the cache lines they each owned.
It's also not really plausible even for a weakly-ordered ISA. Often when you read or write a cache line, you're going to do more reads+writes. Sending each read or write request separately over a mesh interconnect between cores would require many many tiny messages. High throughput is much easier to achieve than low latency: wider buses can do that. Low latency for loads is essential for high performance. If threads ever migrate between cores, all of a sudden they'll be read/writing cache lines that are all hot in L1d on some other core, which would be horrible until the CPU somehow decided that it should migrate the cache line to the core accessing it.
L1d caches are small, fast, and relatively simple. The logic for ordering a core's reads+writes relative to each other, and for doing speculative loads, is all internal to a single core. (Store buffer, or on Intel actually a Memory Order Buffer to track speculative loads as well as stores.)
This is why you should avoid even touching a shared variable if you can prove you don't have to. (Or use exponential backoff for cases where that's appropriate). And why a CAS loop should spin read-only waiting to see the value its looking for before even attempting a CAS, instead of hammering on the cache line with writes from failing lock cmpxchg
attempts.