@the8472's answer is correct, but I wanted to add an alternate answer.
https://www.felixcloutier.com/x86/CMPXCHG.html already specifies the behaviour in enough detail to rule out the possibility of spurious failure. If it could fail for some reason other than the value in memory not matching eax
, the docs would have to say so.
You can also note the fact that compilers use a single lock cmpxchg
for C++11 std::atomic::compare_exchange_strong
, from which you can conclude that compiler writers think lock cmpxchg
can't spuriously fail.
#include <atomic>
bool cas_bool(std::atomic_int *a, int expect, int want) {
return a->compare_exchange_strong(expect, want);
}
compiles to (gcc7.3 -O3):
cas_bool(std::atomic<int>*, int, int):
mov eax, esi
lock cmpxchg DWORD PTR [rdi], edx
sete al
ret
See also Can num++ be atomic for 'int num'? for more details of how lock
ed instructions are implemented internally, and how they interact with MESI. (i.e. @the8472's answer is the short version: for an operand that doesn't cross a cache line, a core just hangs onto that cache line so nothing else in the system can read or write it for the duration of the lock cmpxchg
).
the destination operand receives a write cycle without regard to the result of the comparison
The read + write pair are atomic with respect to all other observers in the system. The ordering you propose, of read1 / read2 / write1 / abort write2 is impossible because lock cmpxchg
is atomic, so read2
can't appear between read1 and write1 in the global order.
Also, that language only applies to the external memory bus. Modern CPUs with integrated memory controllers can do whatever they want (for lock cmpxchg
on an address that's split across two cache lines). Intel may publish documentation for motherboard vendors to use in their internal testing of signals on the memory bus.
That documentation might still be relevant for lock cmpxchg
on an MMIO address, but definitely not for an aligned operand in write-back memory. In that case, it's just a cache lock. (And it's a hidden implementation detail whether the L1d cache is written or not when the compare fails). I guess you could test this by seeing if it dirties the cache line (i.e. puts it in Modified state instead of Exclusive).
For more discussion about how lock cmpxchg
might work internally vs. xchg
, see the chat thread between me and @BeeOnRope following my answer on Exit critical region. (Mostly me having ideas that could work in theory, but are incompatible with what we know about Intel x86 CPUs, and @BeeOnRope pointing out my mistakes. https://chat.stackoverflow.com/transcript/message/42472667#42472667. There's very little we can conclude for sure about the fine details of efficiency of xchg
vs. lock cmpxchg
. It's certainly possible that xchg
keeps the cache line locked for fewer cycles than lock cmpxchg
, but that needs to be tested. I think xchg
has better latency if used back-to-back on the same location from a single thread, though.)