Can CAS fail for all threads?

Question

I'm reading about [lock cmpxchg description]) https://www.felixcloutier.com/x86/CMPXCHG.html):

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

Now consider two threads executing lock cmpxchg:

Thread 1                Thread 2
mov ebx, 0x4000                mov ebx, 0x4000 ; address
mov edx, 0x62ab6                mov edx, 0x62ab8 ;  new val
mov eax, 0x62ab1                mov eax, 0x62ab1 ;  old
lock cmpxchg [ebx], eax                lock cmpxchg [ebx], eax ;  <----- here

The question is can both lock'ed cmpxchg in Thread 1 and Thread 2 fail?

Since

the destination operand receives a write cycle without regard to the result of the comparison

I could guess that both of the threads can have the write cycle and than both of them can be reverted because of comparing to a stale value... But I'm not sure if this is correct.

Maybe I need to look at the cas implementation details, but it is not specified in the intel instruction reference (At least I could not find)

The problem is with the "old value". As soon as you have loaded eax, the value in memory might change again before you have started the comparison. — Bo Persson, May 10 '18 at 10:00
@BoPersson Can you please expand? Changed again by whom? Okay, let's say that both of the threads loaded old value into `eax` on different cores. What will happen next after the cores start executing `lock cmpxchg`. — Linda Turasco, May 10 '18 at 10:26
*"Changed again by whom?"* - by the other thread, or by anyone (there can be more cores fiddling with memory). The `lock` will make sure one of the cores will "win" and get access first for that instruction. The other one will have to wait (and will then see a modified value). — Bo Persson, May 10 '18 at 10:32
@BoPersson Okay, understood, thanks. So with the `lock` prefix one thread will win anyway. I was confused by the fact that `LL/SC` pair can fail spuriously and no docs say that cas cannot. Btw, can you suggest some reading about `CAS` implementation details on intel CPUs? — Linda Turasco, May 10 '18 at 10:44
Your example doesn't actually say that you know the old value is `0x62ab1`. Maybe say that, or better change it to `mov eax, [ebx]` to actually load the old value. And also assert that there's no 3rd thread which can change it. — Peter Cordes, May 11 '18 at 02:15

score 3 · Answer 1 · answered May 10 '18 at 11:15

My understanding is that lock cmpxchg cannot fail spuriously - unlike LL/SC - assuming the value at the memory address indeed matches. It builds those guarantees from the cache coherency protocol by taking exclusive ownership of the cache line and not yielding it to other cores until the operation is done.

So CAS can only fail for all threads if some other thread wrote to the memory location.

Peter Cordes · Answer 2 · 2018-05-11T02:45:56.817

@the8472's answer is correct, but I wanted to add an alternate answer.

https://www.felixcloutier.com/x86/CMPXCHG.html already specifies the behaviour in enough detail to rule out the possibility of spurious failure. If it could fail for some reason other than the value in memory not matching eax, the docs would have to say so.

You can also note the fact that compilers use a single lock cmpxchg for C++11 std::atomic::compare_exchange_strong, from which you can conclude that compiler writers think lock cmpxchg can't spuriously fail.

#include <atomic>

bool cas_bool(std::atomic_int *a, int expect, int want) {
    return a->compare_exchange_strong(expect, want);
}

compiles to (gcc7.3 -O3):

cas_bool(std::atomic<int>*, int, int):
    mov     eax, esi
    lock cmpxchg    DWORD PTR [rdi], edx
    sete    al
    ret

See also Can num++ be atomic for 'int num'? for more details of how locked instructions are implemented internally, and how they interact with MESI. (i.e. @the8472's answer is the short version: for an operand that doesn't cross a cache line, a core just hangs onto that cache line so nothing else in the system can read or write it for the duration of the lock cmpxchg).

the destination operand receives a write cycle without regard to the result of the comparison

The read + write pair are atomic with respect to all other observers in the system. The ordering you propose, of read1 / read2 / write1 / abort write2 is impossible because lock cmpxchg is atomic, so read2 can't appear between read1 and write1 in the global order.

Also, that language only applies to the external memory bus. Modern CPUs with integrated memory controllers can do whatever they want (for lock cmpxchg on an address that's split across two cache lines). Intel may publish documentation for motherboard vendors to use in their internal testing of signals on the memory bus.

That documentation might still be relevant for lock cmpxchg on an MMIO address, but definitely not for an aligned operand in write-back memory. In that case, it's just a cache lock. (And it's a hidden implementation detail whether the L1d cache is written or not when the compare fails). I guess you could test this by seeing if it dirties the cache line (i.e. puts it in Modified state instead of Exclusive).

For more discussion about how lock cmpxchg might work internally vs. xchg, see the chat thread between me and @BeeOnRope following my answer on Exit critical region. (Mostly me having ideas that could work in theory, but are incompatible with what we know about Intel x86 CPUs, and @BeeOnRope pointing out my mistakes. https://chat.stackoverflow.com/transcript/message/42472667#42472667. There's very little we can conclude for sure about the fine details of efficiency of xchg vs. lock cmpxchg. It's certainly possible that xchg keeps the cache line locked for fewer cycles than lock cmpxchg, but that needs to be tested. I think xchg has better latency if used back-to-back on the same location from a single thread, though.)

Can CAS fail for all threads?

2 Answers2

Linked