How does CAS instructions guarantee atomicity

Question

According to Wiki, CAS do something like this:

function cas(p : pointer to int, old : int, new : int) returns bool {
    if *p ≠ old {
        return false
    }
    *p ← new
    return true
}

Well, it seems for me that if several processors will try to execute CAS instruction with the same arguments, there can be several write attempts at the same time so it's not safe to do it anyway.

Where am I wrong?

Having several writes at the same time is not a problem as long as it's clear who won. You may be thinking of the [ABA problem](https://en.m.wikipedia.org/wiki/ABA_problem) which requires additional handling to be safe. But in the case of ints that's no problem. — Voo, Aug 14 '16 at 09:05
@Voo "Having several writes at the same time is not a problem" -- are you sure? I thought that it's unsafe to do that because, for example, x86 doesn't guarantee atomicity of writes for non-aligned DWORDs — FrozenHeart, Aug 14 '16 at 09:09

Peter Cordes · Accepted Answer · 2022-10-17T10:04:28.980

7

Atomic read-compare-write instructions from multiple cores at the same time (on the same cache line) do contend with each other, but it's up to hardware to sort that out. Hardware arbitration of atomic RMW instructions is a real thing in modern CPUs, and provides some degree of fairness so that one thread spinning on lock cmpxchg can't totally block other threads doing the same thing.

(Although that's a bad design unless your retry could succeed without waiting for another thread to modify anything, e.g. a retry loop that implements fetch_or or similar can try again with the updated value of expected. But if waiting for a lock or flag to change, after the initial CAS fails, it's better to spin on an acquire or relaxed load and only do the CAS if it might succeed.)

There's no guarantee what order they happen in, which is why you need to carefully design your algorithm so that correctness only depends on that compare-and-exchange being atomic. (The ABA problem is a common pitfall).

BTW, that entire block of pseudocode happens as a single atomic operation. Making a read-compare-write or read-modify-write happen as a single atomic operation is much harder for the hardware than just stores, which MESIF/MOESI handle just fine.

are you sure? I thought that it's unsafe to do that because, for example, x86 doesn't guarantee atomicity of writes for non-aligned DWORDs

lock cmpxchg makes the operation atomic regardless of alignment. It's potentially a lot slower for unaligned, especially on cache-line splits where atomically modifying a single cache line isn't enough.

See also Atomicity on x86 where I explain what it means for an operation to be atomic.

edited Oct 17 '22 at 10:04

answered Aug 14 '16 at 09:13

Peter Cordes

328,167
45
605
847

Is it not enough to just have the `CMPXCHG` instruction? Should I mark it with the `LOCK` prefix too? – FrozenHeart Aug 14 '16 at 09:18
1

@FrozenHeart: You need `lock` if you want it to be atomic! `lock` is implicit for `xchg` with a memory operand, but not for any other instruction. – Peter Cordes Aug 14 '16 at 09:26
So, `CAS` by itself is not enough to provide atomicity, so we should prefix it with the `LOCK` prefix as well. What's the point of having such instruction like `CMPXCHG` then? Just because we can `LOCK` only **one** instruction at the time? – FrozenHeart Aug 14 '16 at 09:30
@FrozenHeart: Are you asking why non-locked `cmpxchg` exists? IDK. x86 supports `lock add [mem], r32`, and that's obviously useful without `lock`. Maybe they thought `cmpxchg` might be useful without `lock`, so didn't make `lock` implicit for it. – Peter Cordes Aug 14 '16 at 09:33
But we certainly can use the `LOCK` prefix for one instruction only? So we can't mark the specific region to be `LOCK`ed, for example – FrozenHeart Aug 14 '16 at 09:34
@FrozenHeart: And yes, `lock` only applies to one instruction. You can't use it to make multiple instructions into an atomic group. For that you need TSX: See http://stackoverflow.com/questions/38894978/can-we-do-something-atomically-with-2-or-more-lock-free-containers-without-locki/38902054?noredirect=1#comment65152037_38894978. (Or the usual solution for making a critical section atomic: have all potential writers and observers use a lock. [simple x86 spinlock](http://stackoverflow.com/questions/37241553/locks-around-memory-manipulation-via-inline-assembly/37246263#37246263)) – Peter Cordes Aug 14 '16 at 09:34
So, if I understand it correctly. We need a `LOCK` prefix for instructions that operate on the same memory address (for example, writes to the same address) just because it is not guaranteed that they will happen atomically on non-aligned memory, so we'll have a problem of multiple cores / processors writing to the same memory region which is UB. Am I right? – FrozenHeart Aug 14 '16 at 09:39
@FrozenHeart: No. UB is a C concept. The x86 ISA is much more specific on what can happen. Why are you so concerned with unaligned addresses? If you want something to be shared between threads, aligning it will be a big win for performance, and will make pure load and pure stores atomic. (But you don't get read-modify-write atomicity for free; that always costs extra) http://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic/36685056#36685056. – Peter Cordes Aug 14 '16 at 09:46
@FrozenHeart: Are you actually trying to write anything in asm, or are you trying to learn about C? Either way, read Jeff Preshing's articles on lock-free programming if you want to do your own stuff without just using locks. This one might be a good start. http://preshing.com/20120612/an-introduction-to-lock-free-programming/ – Peter Cordes Aug 14 '16 at 09:47
"No. UB is a C concept" -- but it's definitely wrong, right? I mean, having two cores / processors that modify the same memory region at once – FrozenHeart Aug 14 '16 at 10:23
@FrozenHeart: Well, it's usually a bug if you have multiple threads modifying the same memory with non-atomic operations, but aligned loads/stores are atomic, and `lock` gives you atomic RMWs. Read Jeff Preshing's link about lock-free algorithms to learn what kind of code you have to write for correctness when doing that. – Peter Cordes Aug 14 '16 at 18:10
1

Using instructions such as `cmpxchg` without the prefix is useful in at least a couple of scenarios: (1) on single-CPU systems, the lock prefix can be removed, at least outside of obscure scenarios such as DMA or other non-CPU agents contending on the same macro. Single-CPU builds (i.e., CONFIG_SMP=n) are a good example - most of the atomic ops are hidden behind macros which can simply omit the `lock` prefix when SMP is disabled. Without the `lock` prefix, you still have instruction-atomicty in the face of interrupts which is all you need. – BeeOnRope Aug 30 '16 at 06:03
(2) Staying with Linux as an example, even on SMP kernels, there are "per CPU" mechanisms that have been introduced to try to deal with the cost and latency of the locked instructions. The idea is to ensure only a single CPU writes to a given location. Since interrupts may still occur, you can use `cmpxchg` to make an interrupt-atomic update to such a per-CPU location without need the locked variant. See for example [this description](https://www.kernel.org/doc/Documentation/local_ops.txt) of "local ops" in the kernel. – BeeOnRope Aug 30 '16 at 06:06
1

@BeeOnRope: that's a good point. Linux actually goes farther than that: booting an SMP kernel on a UP system actually patches the `lock` prefixes into `0x90` NOP in all the places where it's not needed on UP. It also patches out some lock/unlock code if preemption is disabled. (IIRC, the kernel log message is `patching SMP alternatives` or something). Excellent point about being atomic at an instruction level; I can see how non-LOCKed CMPXCHG could actually be useful, the same way that [`inc [foo]` is different from separate load/store](http://stackoverflow.com/a/34780606/224132). – Peter Cordes Aug 30 '16 at 06:32

score 1 · Answer 2 · answered Aug 14 '16 at 09:11

If you read the wiki it says that CAS "is an atomic version of the following pseudocode" of the code you posted. Atomic means that the code will execute without interruptions from other threads. So even if several threads try to execute this code at the same time with the same arguments (like you suggest) only one of them will return true, because in practice they will not execute simultaneously since the atomicity require they run in isolation.

And since you mention "x86 doesn't guarantee atomicity of writes for non-aligned DWORDs", this is not an issue here either because the atomic property of the cas function.

How does CAS instructions guarantee atomicity

2 Answers2

Linked