Does CAS operation has to lock cache lines with all 3 values?

Question

If I have compare and swap operation

   void push(const T& data){
            node * newNode = new node{data};
            newNode->next = head.load();
            while(!head.compare_exchange_weak(newNode->next, newNode));
        }

Will such operation make cache lines with pointers head, newNode->next and newNode locked for the thread it's executing in?

P.S. If they are stored in different cache lines of course.

The question is unclear to me: what do you mean by "unique"? Do you mean the cache lines are locked? Do you means the 3 variables are stored in different cache lines? Note that `desired` is passed by copy. — Jérôme Richard, Aug 05 '21 at 22:25
Maybe the important point here is that the implicit load and store of `newNode->next` are not themselves atomic. If some other thread were reading or writing `newNode->next`, and is not otherwise synchronized with this code, then you would have a data race. — Nate Eldredge, Aug 06 '21 at 23:15

Jérôme Richard · Answer 1 · 2021-08-06T16:23:23.167

1

This is dependent of the target architecture. Indeed, cache line locking is a strategy dependent of the target platform and not is thus not described in the C++ specification (the standard only say that the read-modify-write is done atomically).

On x86 platform, compilers generally use the lock cmpxchg instruction when the size of the data type is small (typically <= 64 bits on a 64 bit system). In such a context, the instruction takes always a memory operand and a register operand in parameter. The compared value is the one stored in the register rax/eax/ax/ah. As a result, expected and desired are directly stored in a register or loaded just before the operation from memory. Only the cache line storing the atomic value could be locked.

Moreover, you can find more information here about the instruction:

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

Thus, the atomic operation can lock temporary the cache line for the time of the lock cmpxchg instruction.

As pointed out by @Frank, atomic CAS operation working on wide types can result in different assembly instructions. In that case, a lock may be used to ensure atomicity. One can use the function is_lock_free() to check either locks are used for one specific case.

For more information about the x86 instruction and x86 lock prefixes, please read:

edited Aug 06 '21 at 16:23

answered Aug 06 '21 at 15:42

Jérôme Richard

41,678
6
29
59

Great answer. It might be worth mentioning that this behavior (on x86, at the very least) is also heavily dependant on the type stored in the atomic. OP's example clearly falls into the `lock cmpxchg` case, but other people landing here might be dealing with something else. – Aug 06 '21 at 15:51
Does it mean that in C++ two threads cannot perform the same CAS at the same time? As I understand atomic means that all threads cant see such operation undone, but do they can perform such operation at the same time? – questionmark Aug 06 '21 at 16:19
@Frank Indeed, I modified the question to add more information about larger types. – Jérôme Richard Aug 06 '21 at 16:23
2

@questionmark "At the same time" is a *very* fuzzy notion when it comes to instructions on modern pipelined CPUs, so I'm not sure your follow-up question is answerable per-se. You can check out https://stackoverflow.com/a/2783981/4442671 for a lot of info on the subject. – Aug 06 '21 at 16:49
@questionmark Well, atomic CAS on the same memory location on x86 processors are serialized at the instruction level (whatever the language used). CAS on different memory locations could run in parallel. C++ do not specify this. It could be possible for a given architecture to make CAS more parallel. One solution would be to add a dedicated unit to do that in caches but this is costly. AFAIK, new Nvidia GPUs do that (to scale better). Still, atomic CAS cannot fully run in parallel. As long as you take care of contention and do not abuse of atomic CAS, this should not be critical. – Jérôme Richard Aug 06 '21 at 17:18
@JérômeRichard but if there are some architectures with parallel CAS it will make it non atomic since two threads can enter CAS at the same time and for example first comparsion on both return true(i'm talking about code above) so they will both try to load desired value to head, am I right? – questionmark Aug 06 '21 at 18:47
nvm, found this -> https://stackoverflow.com/questions/38940616/how-does-cas-instructions-guarantee-atomicity – questionmark Aug 06 '21 at 18:50

Does CAS operation has to lock cache lines with all 3 values?

1 Answers1