Why can the MESI protocol not guarantee atomicity of CMPXCHG on x86 without the LOCK prefix?

Question

I understand that the MESI protocol successfully guarantees the same view of memory (caches) for different cores. My question comes from the fact that during writing MESI guarantees that the cache is exclusively owned by a CPU and then atomic CMPXCHG just compares and exchanges values atomically. So why do we need to use the LOCK instruction and thus lock the cache line when we already have that guarantee from the MESI protocol?

Because the cache is accessed twice (to read and to write back the result) and you don't want other cores to interfere between those operations — LWimsey, May 05 '19 at 19:53
That's why I'm confused ... why is cache accessed twice when instruction is one ? shouldn't it be accessed only once with exclusive access ? — shota silagadze, May 05 '19 at 19:56
If you are going to atomically modify a value, first the CPU needs to load the current value from the cache into a register, then modify it inside the register and then write it back to the cache.. From the outside it looks atomic, but it isn't, hence the locking — LWimsey, May 05 '19 at 20:01
So from multithreaded view atomic add is single operation and between the steps you mentioned there cannot be context switch of that thread but from MESI protocol perspective there are three operations and exclusive cache access is only required in the last step when writing is needed so we need to use LOCK prefix to lock that cache line during that three steps (but from assembly for that 1 operation) is that right? — shota silagadze, May 05 '19 at 20:13
Yes, you can say that the RMW is a single operation and a context switch won't occur as long as that operation has not completed. But since the cache is accessed twice and modified, executing core A needs the cache line read-write (MESI: 'modified' or 'exclusive'). Without the lock, an RMW on another core B could request the same cache line and then MESI would mark the cache line on core A 'invalid'. Core A can (and will) get the cache line back, but then core B may have modified it.. — LWimsey, May 05 '19 at 20:50
I am not sure where you get that 3 MESI operations are involved, but don't overthink MESI.. It's a coherency protocol that guarantees a coherent memory view. The lock is necessary because MESI operates on a lower level than the RMW operation and MESI doesn't know that the core needs uninterrupted access twice. During the RMW operation (while the lock is active), MESI cannot change the state of the cache line. — LWimsey, May 05 '19 at 20:52
If the cache had this property of implicitly locking a memory location then modifying the same location in close succession could lock up the whole computer! — curiousguy, May 06 '19 at 23:29

Peter Cordes · Accepted Answer · 2019-05-05T21:29:52.333

atomic CMPXCHG just compares and exchanges values atomically

No, the cache-access hardware doesn't implement CMPXCHG as a single-cycle inherently-atomic operation. It's synthesized out of multiple uops that load and separately store.

If that's how regular CMPXCHG worked, your reasoning would be correct. But regular CMPXCHG is not atomic (for observers on other cores).

lock cmpxchg decodes to multiple uops that keep the cache-line "locked" from the load to the store, turning it into a single atomic transaction as far as any other observers in the system can see. (i.e. delay responding to MESI invalidate or share requests for that line until after the store commits). It also makes it a full memory barrier.

Without lock, CMPXCHG decodes to multiple uops that load, check for equality stuff, and then either store a new value or not according to the compare result. As far as atomicity, it's the same as add [mem], edx, which uses the ALU for addition in between load and store uops. i.e. it's not atomic, except on the same core with respect to interrupts (because interrupts can only happen at an instruction boundary).

The load and store are each separately atomic, but they aren't a single atomic RMW transaction. If another core invalidates our copy of the cache line and stores a new value between our load and our store, our store will step on the other store. And that other store will appear in the global order of operations on that cache line between our load and store, violating the definition of "atomic" = indivisible.

Can num++ be atomic for 'int num'? why add [mem], edx isn't atomic, and how lock works to make it atomic.
Is x86 CMPXCHG atomic, if so why does it need LOCK? use-cases for cmpxchg without lock: uniprocessor machines.

"_because interrupts can only happen at an instruction boundary_" even for "string" instructions? — curiousguy, May 06 '19 at 18:22
@curiousguy: I over-simplified, oops. Each iteration of a `rep`-string instruction counts as a separate instruction for interruptibility. Gather loads and scatter stores are also interruptible, with progress recorded in the mask vector (or AVX512 mask register). They might not be implemented that way in practice, except for self-generated exceptions. I don't think there are any others, [last time this came up](https://stackoverflow.com/questions/54821523/in-x86-intel-vt-x-non-root-mode-can-an-interrupt-be-delivered-at-every-instruct#comment96419181_54821523) that's all we could think of. — Peter Cordes, May 06 '19 at 21:00

Why can the MESI protocol not guarantee atomicity of CMPXCHG on x86 without the LOCK prefix?

1 Answers1