Is x86 CMPXCHG atomic, if so why does it need LOCK?

Question

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

My question is

Can CMPXCHG operate with memory address? From the document it seems not but can anyone confirm that only works with actual VALUE in registers, not memory address?
If CMPXCHG isn't atomic and a high level language level CAS has to be implemented through LOCK CMPXCHG (with LOCK prefix), what's the purpose of introducing such an instruction at all?

(I am asking from a high level language perspective. I.e., if the lock-free algorithm has to be translated into a LOCK CMPXCHG on the x86 platform, then it's still prefixed with LOCK. That means the lock-free algorithms are not better than ones with a carefully written synchronized lock / mutex (on x86 at least). This also seems to make the naked CMPXCHG instruction pointless, as I guess the major point for introducing it, was to support such lock-free operations.)

Obviously you can use a memory address, that's the whole point. The first operand is of type r/m, so there you go. And how could you prefix the instruction with `lock` if it didn't itself exist? — harold, Jan 08 '15 at 11:49
@harold I don't quite understand what doesn't exist. You prefix with LOCK if you want the instruction to be atomic. So the CMPXCHG, without LOCK prefix, is atomic or not? — Alex Suo, Jan 08 '15 at 12:38
No, but in your question 2 you seem to ask why "cmpxchg without lock" exists, which is sort of odd, since the combination can't exist without the parts - if that is not what you meant then can you clarify? — harold, Jan 08 '15 at 12:45
@harold I am asking from a high level language perspective. I.e. if the lock-free algorithms has to be translated into LOCK CMPXCHG on x86 platform, then it's still prefixed with LOCK. That means the lock-free algos are not better than a carefully written synchronized lock / murex (on x86 at least). This seems to make the CMPXCHG instruction pointless, as I guess the major point for introducing it, is to support such lock-free operations. — Alex Suo, Jan 08 '15 at 12:54
@Alex Suo: You are mixing up high-level locks with the lowlevel CPU feature that happened to be named `LOCK`. The high-level locks that lock-free algorithms try to avoid will have to put threads into wait state until the lock is available which is a costly operation and an entirely different thing than the CPU `LOCK` prefix feature which might hold other threads for a single instruction only. — Holger, Jan 08 '15 at 18:12
@Holger Thanks. Can you post this as an answer so I can accept it :) — Alex Suo, Jan 09 '15 at 01:43

Peter Cordes · Answer 1 · 2023-05-16T20:11:03.563

It seems like part what you're really asking is:

Why isn't the lock prefix implicit for cmpxchg with a memory operand, like it is for xchg (since 386)?

The simple answer (that others have given) is simply that Intel designed it this way. But this leads to the question:

Why did Intel do that? Is there a use-case for cmpxchg without lock?

On a single-CPU system, cmpxchg is atomic with respect to other threads, or any other code running on the same CPU core. (But not to "system" observers like a memory-mapped I/O device, or a device doing DMA reads of normal memory, so lock cmpxchg was relevant even on uniprocessor CPU designs).

Context switches can only happen on interrupts, and interrupts happen before or after an instruction, not in the middle. Any code running on the same CPU will see the cmpxchg as either fully executed or not at all.

For example, the Linux kernel is normally compiled with SMP support, so it uses lock cmpxchg for atomic CAS. But when booted on a single-processor system, it will patch the lock prefix to a ds prefix everywhere that code was inlined, since plain cmpxchg without the lock runs much faster than lock cmpxchg. (The ds prefix has no effect except to take up the space; Linux uses a flat memory model so even in 32-bit code using (%ebp) or (%esp) addressing modes, it's still the same as a plain cmpxchg.) For more info, see this LWN article about Linux's "SMP alternatives" system. It can even patch back to lock prefixes before hot-plugging a second CPU.

Read more about atomicity of single instructions on uniprocessor systems in this answer, and in @supercat's answer + comments on Can num++ be atomic for int num. See my answer there for lots of details about how atomicity really works / is implemented for read-modify-write instructions like lock cmpxchg.

(This same reasoning also applies to cmpxchg8b / cmpxchg16b, and xadd, which are usually only used for synchonization / atomic ops, not to make single-threaded code run faster. Of course memory-destination instructions like add [mem], reg have obvious uses for non-shared data.)

Interrupting instruction in the middle of execution only a few instructions like rep movsb and vpgatherdd are interruptible part way through, and they don't support lock. They also have a well-defined way to update architectural state to record their partial progress, not like a few ISAs where microarchitectural progress can get saved in hidden locations and resumed after an interrupt.
Interrupting an assembly instruction while it is operating quotes Intel's manuals about that guarantee
When an interrupt occurs, what happens to instructions in the pipeline?

" But when booted on a single-processor system, will patch the lock prefix to a nop everywhere that code was inlined, since nop cmpxchg runs much faster than lock cmpxchg." I'd assume you mean compiles in a single-processor system? As I am not aware of OS can patch such compiled instructions at run-time. — Alex Suo, Jun 01 '17 at 01:25
@AlexSuo: No, Linux's [SMP alternatives](https://lwn.net/Articles/164121/) system really does patch the kernel image on UP systems. (And BTW, if it was purely a compile-time thing, it would depend on whether you were building *for* a UP system, not *on* a UP system. I think if you leave out `CONFIG_SMP`, some locking / synchronization stuff can be left out entirely instead of being patched into NOPs at boot time. But these days probably not that much, especially with the standard `CONFIG_PREEMPT` that allows kernel code to be pre-empted.) — Peter Cordes, Jun 01 '17 at 03:03
@AlexSuo, IMHO this answer should be chosen as the accepted one. — raiks, Mar 20 '20 at 14:37
@raiks: I thought the same thing; [Holger pointed out](https://stackoverflow.com/questions/27837731/is-x86-cmpxchg-atomic-if-so-why-does-it-need-lock/44273130?noredirect=1#comment75565684_27856649) that the OP's real confusion was different from what they asked in the question itself, so that explains why Holger wrote that answer and the OP accepted it, even though the interesting part of the question (IMO) is what I answered. — Peter Cordes, Mar 20 '20 at 19:49
@Jeshan: That's (almost) correct, [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) stands for "Symmetric multiprocessing". I assumed everyone knew what the term stood for. Configuring Linux with `CONFIG_SMP=y` is a compile-time choice to include code necessary to take advantage of multiple CPU cores. Without it, there won't be any `lock` prefixes to patch away in the first place. — Peter Cordes, Feb 16 '22 at 07:12
The LOCK prefix is changed to a DS: segment override prefix, not a separate NOP instruction. This 2008 patch made the change, motivated by a correctness issue (but it's also more efficient): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f88f07e0 — amonakov, May 16 '23 at 20:06
@amonakov: Thanks, that makes sense. I was thinking `nop` seemed like probably not the best way possible, but I didn't think of using a prefix and hadn't checked the actual code. — Peter Cordes, May 16 '23 at 20:21

score 34 · Accepted Answer · answered Jan 09 '15 at 08:41

34

You are mixing up high-level locks with the low-level CPU feature that happened to be named LOCK.

The high-level locks that lock-free algorithms try to avoid can guard arbitrary code fragments whose execution may take arbitrary time and thus, these locks will have to put threads into wait state until the lock is available which is a costly operation, e.g. implies maintaining a queue of waiting threads.

This is an entirely different thing than the CPU LOCK prefix feature which guards a single instruction only and thus might hold other threads for the duration of that single instruction only. Since this is implemented by the CPU itself, it doesn’t require additional software efforts.

Therefore the challenge of developing lock-free algorithms is not the removal of synchronization entirely, it boils down to reduce the critical section of the code to a single atomic operation which will be provided by the CPU itself.

answered Jan 09 '15 at 08:41

Holger

285,553
42
434
765

Then is it correct to say that CMPXCHG still holds the lock which is different from program level lock (i.e. JVM lock) – Rohit Sachan Apr 19 '16 at 09:06
5

@Rohit Sachan: you could say that it holds the BUS lock, but since this holds for every memory access, the only difference is that it is held for two memory accesses made by a single instruction and, more important, it’s just confusing when talking about “lock free programming”. In other words, you should always care about whether the discussion is about hardware or software architecture… – Holger Apr 19 '16 at 09:37
7

I think the OP is partly asking "what's the point of `cmpxchg` without `lock`?". See [my answer](https://stackoverflow.com/questions/27837731/is-x86-cmpxchg-atomic/44273130#44273130): Intel designed it that way because it's useful on a uniprocessor system. – Peter Cordes May 30 '17 at 22:54
5

@Peter Cordes: the OP’s actual question became apparent in [this comment](https://stackoverflow.com/questions/27837731/is-x86-cmpxchg-atomic/27856649?noredirect=1#comment44084729_27837731). I only posted an answer after the OP acknowledged that this confusion between high level locks and the instruction prefix is the actual issue. Your addition still might be valuable for future readers finding this question after searching for “lock prefix”. – Holger May 31 '17 at 08:21
2

Ah I see. No wonder this question didn't get much attention. I only noticed it in the related list while tidying up some of my old answers, and didn't wade through all the comments. – Peter Cordes May 31 '17 at 08:25
What a lucid and technical explanation of the backend philosophy! – soham Aug 09 '20 at 21:02
Thank you Holger, your answer really helps me a lot ! – zhangjie Apr 16 '21 at 05:24
Very interesting answer. But it still eludes me why they made the lock on XCHG implicit. Afterall there are more uses for an unlocked XCHG [mem],reg, certainly because the 16/32-bit x86 design isn't exactly renowned for its large register file. – Olorin Mar 23 '23 at 13:25

score 3 · Answer 3 · edited Nov 01 '20 at 09:22

3

The LOCK prefix is to lock the memory access for the current command, so that other commands that are in the CPU pipeline can not access the memory at the same time. Using the LOCK prefix, the execution of the command won't be interrupted by another command in the CPU pipeline due to memory access of other commands that are executed at the same time. The INTEL manual says:

The LOCK prefix can be prepended only to the following in structions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. If the LOCK prefix is used with one of these instructions and the source operand is a memory operand, an undefined opcode exception (#UD) may be generated.

edited Nov 01 '20 at 09:22

janw

8,758
11
40
62

answered Jan 08 '15 at 16:13

Van Uitkon

356
1
6

Which pipeline? – curiousguy Nov 17 '19 at 17:37
Did you mean "**can't** access memory at this time?" – Peter Cordes Oct 22 '20 at 20:35
1

Equally importantly, `lock` prevents accesses from *other* cores (and even DMA etc.) from appearing to happen between the load and store of an atomic RMW. Merely preventing other operations from this core from interleaving wouldn't be sufficient. – Peter Cordes Nov 01 '20 at 09:55

Is x86 CMPXCHG atomic, if so why does it need LOCK?

3 Answers3

Related:

Linked

Related