Are X86 atomic RMW instructions wait free

Question

On x86, atomic RMW instructions like lock add dword [rdi], 1 are implemented using cache locking on modern CPUs. So a cache line is locked for duration of the instruction. This is done by getting the line EXCLUSIVE/MODIFIED state when value is read and the CPU will not respond to MESI requests from other CPU's until the instruction is finished.

There are 2 flavors of concurrent progress conditions, blocking and non-blocking. Atomic RMW instructions are non-blocking. CPU hardware will never sleep or do something else while holding a cache lock (an interrupt happens before or after an atomic RMW, not during), there is a finite (and small) upper bound on the number of steps before a cache line is released.

Non blocking algorithms can be split in 3 flavors in theoretical computer science:

wait free: all threads will make progress in a finite number of steps.
lock free: at least one thread will make progress in a finite number of steps
obstruction free: if there is no contention, a thread will make progress in a finite number of steps

What kind of guarantee does x86 provide?

I guess it is at least lock free; if there is contention, at least one CPU will make progress.

But is x86 wait free for atomic instructions? Is every CPU guaranteed to make progress in a finite number of steps or could it be that one or more CPU's are starved and could potentially be delayed indefinitely?

So what happens when there are multiple cores doing atomic operations on the same cache line?

Note that `lock` is still able to send a request to the quiescent master (i.e. system agent) coupled with the home agent that will handle the access to memory. `lock` also works with non-cached memory. Anyway, when a lock is in progress other agents have to wait, that's all. Remember the F00F bug? Hardware has different constraints than SW, what you call "wait free" is usally labelled "speculation" and "ouf of order". What could a core do if it needs to commit a store but it cannot because the target cache is delaying the response? — Margaret Bloom, May 12 '20 at 06:55
Hi Margaret, thank you for your comment; also all your other comments on X86 are very educational. I'm not understanding how your answer matches my question. Does the X86 provide fairness on gaining a lock on a heavily contended cacheline? So imagine 2 CPU's doing an atomic operation on the same cacheline; will there be a 50/50 distribution on the number of atomic operations? And what if there are many more cores? I just ran a quick benchmark on a dual socket box and on my local box with a contended AtomicLong.incrementAndGet (lock xadd) and it seems to be evenly distributed. — pveentjer, May 12 '20 at 07:40
Both QPI and PCIe have flow control based on credits/debits. I'm not an expert on these, but they can be used to perform *arbitration*. I'd expect it to be fair. When a socket accesses another's one cache, the request will, in the end, be handled by the target socket's caching agent. So an intra-core arbitration should suffice to give fairness. But I know very little about this implementation details, probably there's something in a patent. — Margaret Bloom, May 12 '20 at 11:44
@pveentjer: Are you counting "number of steps" in instructions? Then clearly wait-free. If you're counting in time or clock cycles, then I'd say stores and atomic RMWs are lock-free, but pure loads are still wait-free. (Any number of cores can load from the same line in parallel, and I assume HW arbitration for access to lines prioritizes loads). BTW, I made some edits to focus your question, but if you have a real-world question you might want to be specific about what you mean by "steps". Otherwise it's just computer-science theory and assigning arbitrary names to things. — Peter Cordes, May 12 '20 at 13:12

Hadi Brais · Answer 1 · 2020-07-31T03:01:49.733

Consider the more general question: If there are multiple active hardware threads, does x86 guarantee that each thread make forward progress irrespective of what other threads do? The question you posed seems to be specifically about the case where each thread is simultaneously executing an atomic instruction to an overlapping memory location. If the answer is yes, then x86 can be described as "wait-free." (The term is usually only applied to describe a thread synchronization algorithm, but anyway.)

I think it's important to define what "forward progress" means from the perspective of an architecture or an implementation thereof. I don't like to use the term "step" in the definition because it's not clear what is a step and what isn't a step. Instead, I'll use the following definition: An active hardware thread makes forward progress when it completes the next dynamic instruction in program order by retiring it or by switching to an exception handler in case of an error condition. If each active hardware thread can make forward progress in a finite amount of time irrespective of what the other threads do and irrespective of what instructions each thread is executing as long as they don't cause the thread to become inactive, then x86 is wait-free. (Note that interrupt handlers are not part of the program being executed on a hardware thread, so handling interrupts doesn't mean that the thread is making forward progress.)

Is every CPU guaranteed to make progress in a finite number of steps or could it be that one or more CPU's are starved and could potentially be delayed indefinitely?

You may be thinking here that if there are two cores continuously attempting to acquire atomic RMW access to the same location whether one of them will always succeed and the other will always fail, getting stuck trying to execute the same atomic instruction without making any progress because it's the next instruction in program order.

This is actually a traditional problem in computer architecture. The reason I want to consider the more general question is because there are many points of possible contention between multiple hardware threads or agents other than acquiring locks. Consider what you said:

CPU hardware will never sleep or do something else while holding a cache lock (an interrupt happens before or after an atomic RMW, not during), there is a finite (and small) upper bound on the number of steps before a cache line is released.
...
I guess it is at least lock free; if there is contention, at least one CPU will make progress.

Intel and AMD have never stated that "there is a finite upper bound on the number of steps before a cache line is released." This reasoning can be applied to almost any stage of an instruction's execution. Is there a finite upper found on the number of steps to fetch an instruction if the fetch missed in the private caches? Is there a finite upper found on the number of steps to read a value from the a shared cache? With hyperthreading, the potential for contention exists almost at every stage of executing any type of instruction. You could ask the same question for each of them. Atomic access contention is not special. One could ask other questions, such as whether it's possible for a core to arbitrarily enter a sleep state and never wake up.

Fundamentally, it doesn't make sense to have multiple cores without making sure at the architectural level, by design, that each core is always able to make forward progress as long as it's active (according to the definition above). Otherwise, the implementation cannot be fully utilized. Every practical ISA has to provide the minimal forward progress guarantee, which is that any operation take a finite amount of time to complete and is preceded by a finite number of other operations in a global (or multi-agent) order of operations. Some ISAs, such as RISC-V, do explicitly state this.

There are many examples where Intel has explicitly stated in the SDM manual and in many other documents that a shared structure is designed such that fairness is guaranteed, which is a stronger grantee than minimal forward progress. (For performance or other reasons, this may not always be accurate, though, because some types of requests may always have a higher or the highest priority. Maybe it's better to say that fairness is typically guaranteed and forward progress is guaranteed in general, or something like that.) These examples include the following (from the top of my head):

On multi-core processors before Nehalem and on multi-core Atom-branded processors, the L2 superqueue (which include the L2 controller) is designed to be (generally) fair and to guarantee progress of all agents that it interacts with.
The front-side bus (on systems that have an FSB) and the APIC bus (on systems that have a separate APIC bus) both are designed to be fair.
Most arbitration points between hardware threads on the same core are designed to be fair. One exception is the uop scheduler, on the microarchitectures that have a unified RS, or the uop schedulers, on the microarchitectures that have a distributed RS, which use a first-ready pseudo-FIFO algorithm.
On processors that use a crossbar interconnect, fairness is guaranteed at the L3 global queue.
On processors with ring interconnects, fairness is guaranteed at some ring stops while only forward progress is guaranteed at other ring stops.

Therefore, if two cores are trying acquire atomic RMW access to the same location, the atomic instructions are guaranteed to make it through the pipelines and memory hierarchies of each core and each core's read-lock requests will eventually get its turn to be serviced. So, yes, x86 is wait-free according to the definition above. It's worth noting, though, that most or all Intel processors have rarely occurring bugs that cause all or a subset of processors to indefinitely hang.

One interesting consideration is whether it's guaranteed that the progress of a core will not be indefinitely blocked due to continuous handling of interrupts. I think this is mostly dependent on the design of interrupt handlers, so the system software has to guarantee this.

TL:DR: I don't think the OP claimed that the CPU can't sleep after starting to execute an atomic RMW, just not during the critical part where that would block other threads. (update: I see you removed that paragraph entirely :) — Peter Cordes, Jul 31 '20 at 02:59
@PeterCordes Oh, you're indeed right. I misunderstood that part. Thanks for pointing it out. — Hadi Brais, Jul 31 '20 at 03:02
Cheers. Re: the overall point of your answer: is "finite time" really the only good definition we can use for wait-free? Meeting it via finite cores + progress guarantees seems different in spirit. I'd like to make a distinction between read scalability to multiple cores in parallel vs. MESI serializing multiple writers because progress scalability seems related to wait-free vs. lock-free as progress guarantees (all threads vs. one thread make progress per time). I guess I've been thinking of an "all cores hammering" situation, not one access mixed with private work; maybe that's my hang-up. — Peter Cordes, Jul 31 '20 at 03:30
I meant to say "finite cores + *fairness* guarantees". Maybe there's different terminology that captures the read-side scalability of pure read-only vs. an atomic RMW on a shared counter. It maybe needs more nuance, because if any given cache line only has one writer, contention is minimal, even if atomic RMWs are being used. So I guess that's one example of why it doesn't fully make sense to say that a `lock xadd` could be lock-free but not wait-free. Or maybe that just means that analyzing algorithms needs to be more sophisticated than max_blocking(building blocks). — Peter Cordes, Jul 31 '20 at 11:59
@PeterCordes Progress Scalability is not related to wait-free vs. lock-free. For example, the upper bound on the number of steps for any thread to make progress may double if we add one more thread, this is not scalable but can still be wait-free. With wait-free, you have the guarantee that no thread can block any other thread from making progress. Lock-free gives the guarantee that even if one thread blocked all other threads from making forward progress, if that thread stops for whatever reason, then at least one other thread will start making progress. So there can be no dead or live locks. — Hadi Brais, Jul 31 '20 at 16:10
Thanks. If I get around to it, I might write an answer pointing out that distinction. I think wait-free often gets simplified to saying that all threads can make progress, vs. at least 1. Or at least that was my understanding of it until recently, so I assume others might have made that mistake that doesn't match the formal definition. — Peter Cordes, Jul 31 '20 at 16:22

spongebob · Answer 2 · 2020-09-29T14:01:44.383

When multiple threads happen to lock the same cache line, their execution is serialized. This is called write contention due to false sharing.

The single-writer principle stems from this. Writes cannot be performed concurrently, as opposed to reads.

The execution time of atomic read-modify-write instructions themselves is fixed and does not depend on the number of threads contending the cache line. Therefore on x86 they are wait-free population-oblivious.

The upper bound of the time it takes to lock a contended cache line is proportional to how much contention the cache line experiences.

From Intel Community:

In some architectures, operations that are not chosen to go first will be stalled (then retried by the hardware until they succeed), while in other architectures they will "fail" (for software-based retry). In an Intel processor, for example, a locked ADD instruction will be retried by the hardware if the target memory location is busy, while a locked "compare and exchange" operation must be checked to see if it succeeded (so the software must notice the failure and retry the operation).

Since cache line locking will be continuously retried, eventually all atomic read-modify-write operations will succeed (the operation is the instruction plus the retrying done by the hardware to lock the cache line).
So yes, every CPU is guaranteed to make progress in a finite number of steps, and atomic read-modify-write operations as a whole on x86 are wait-free bounded.

By the same logic, the x86 store operation is wait-free bounded, x86 store instruction is wait-free population-oblivious, and x86 load is always wait-free population-oblivious.

While, as someone suggests, a ucode bug could cause the lock to stay on forever, we do not consider external factors when describing the flavor of an algorithm, but only the logic itself.

Cache line lock acquisition is not fair.

The probability that a thread is selected to acquire the lock is proportional to how close it is to the thread that released the lock. So, threads on the same core are more likely to acquire the lock than threads that share the L2 cache, which are more likely than threads that share the L3 cache. Then, threads on shorter QPI/UPI/NUMA Node paths have an edge over others, and so on.

This holds true for software locks (spin locks) too, since when a release store is issued, it propagates the same way.

I ran a benchmark on Intel Q4'17 desktop CPU that confirms all of the above.
When continuously lock xadding over the same memory location...

for 10 seconds, out of 5 threads running on different cores the fastest thread lock xadded 2.5 times more than the slowest one, and out of 10 threads running on different two-way hyper-threads it did 3 times more
300 million times, on average increasingly tinier numbers of lock xadds take increasingly greater amounts of time, up to 1.1ms for 5 threads running on different cores and up to 193ms for 10 threads running on different two-way hyper-threads

and variance across runs of different processes is high.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/218250/discussion-on-answer-by-francesco-menzani-are-x86-atomic-rmw-instructions-wait-f). — Samuel Liew, Jul 21 '20 at 04:21

Are X86 atomic RMW instructions wait free

2 Answers2

Linked