How do I use the LOCK ASM prefix to read a value?

Question

I know how to use LOCK to thread-safely increment a value:

  lock inc     [J];

But how do I read [J] (or any value) in a thread-safe manner? The LOCK prefix can't be used with mov. And if I do the following:

  xor eax, eax;
  lock add eax, [J];
  mov [JC], eax;

It raises an error on line 2.

What is it that you're trying to achieve here? – Iridium Jul 27 '10 at 17:35 — Iridium, Jul 27 '10 at 17:35
All I want to do is read [J] in a thread-safe way. – IamIC Jul 27 '10 at 19:20 — IamIC, Jul 27 '10 at 19:20

score 9 · Accepted Answer · edited May 16 '21 at 23:42

Use XADD or MOV instruction instead ADD instruction! See also MFENCE, LFENCE and SFENCE instructions!

EDIT: You can't use LOCK instruction with ADD instruction if source operand is a memory operand!

From: "Intel® 64 and IA-32 Architectures Software Developer’s Manual"

The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. If the LOCK prefix is used with one of these instructions and the source operand is a memory operand, an undefined opcode exception (#UD) may be generated. An undefined opcode exception will also be generated if the LOCK prefix is used with any instruction not in the above list. The XCHG instruction always asserts the LOCK# signal regardless of the presence or absence of the LOCK prefix

EDIT2: Form: "Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume3A"

8.1.1 Guaranteed Atomic Operations. The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

Reading or writing a byte

Reading or writing a word aligned on a 16-bit boundary

Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

Reading or writing a quadword aligned on a 64-bit boundary

6-bit accesses to uncached memory locations that fit within a 32-bit
data bus The P6 family processors
(and newer processors since)
guarantee that the following
additional memory operation will
always be carried out atomically:

Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit
within a cache line

Accesses to cacheable memory that are split across bus widths, cache lines, and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

So, for reading I prefer to use CMPXCHG instruction with LOCK prefix, like:

LOCK        CMPXCHG   EAX, [J]

For writing:

MOV   [J], EAX
SFENCE

.

The complier refuses to compile line 2. Even changing it to: xadd eax, dword ptr J; doesn't work. — IamIC, Jul 27 '10 at 19:20
Thanks for the clarification on why it wouldn't compile. So it seems I would either have to use the fencing instructions, or just make a snapshot of the value which, since it's being processed read only, shouldn't be a problem thread-safety-wise: mov eax, [J]; mov [JSnapshot], eax; — IamIC, Jul 27 '10 at 21:46
Like it says, an aligned 32 bit value is automatically atomic. Nice! Thanks. — IamIC, Jul 28 '10 at 15:51
SFENCE is useless here, and for a pure read you just want a `mov` load, after making sure `J` is aligned. (@IamIC). Only use `lock cmpxchg` or `lock xadd` unless you also want a full memory barrier as part of the same operation. — Peter Cordes, May 17 '21 at 00:23
Also, a plain `mov` still requires alignment to be atomic, in which case you can definitely use plain `mov` for the load. — Peter Cordes, May 17 '21 at 00:33

Peter Cordes · Answer 2 · 2021-05-17T17:17:53.180

Normally you can make sure J is sufficiently aligned (e.g. naturally aligned).
Then plain mov is sufficient for pure-load or pure-store,
and much more efficient than lock-anything in the uncontended case.

GJ's answer quotes the relevant part of Intel's manual re: alignment, same as in Why is integer assignment on a naturally aligned variable atomic on x86? Note that the common subset that's atomic on AMD as well is not as forgiving as just Intel: AMD can tear across boundaries narrows than a cache line, but naturally-aligned 8-byte load/store are safe on both.

If you're familiar with C++11 std::atomic memory_order_acquire / _release and seq_cst, see the mappings to asm for various ISAs: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. Or look at compiler output for stuff like x.store(1, std::memory_order_release) on https://godbolt.org/

default rel
section .bss
  align 4     ; natural alignment
J: resd 1     ; reserve 1 DWORD (NASM syntax)

section .text
   mov  eax, [J]       ; read J   (acquire semantics)

   mov  [J], eax       ; write J  (release semantics)

;;; seq_cst write J and wait for it to be globally visible before later loads (and stores, but that already happens with mov)
   xchg [J], eax       ; implicit  LOCK prefix, full memory barrier.

The seq_cst store could also be done with mov [J], eax + mfence, but that's usually slower on most CPUs; GCC recently switched to using XCHG, like other compilers have been doing for a while. In fact, MFENCE is so slow on Skylake that it can be better to use lock or byte [rsp], 0 instead of mfence when you need a barrier separate from a store. (atomic_thread_fence(mo_seq_cst))

Both parts of @GJ's suggested code are unnecessarily slow, unfortunately.

You also don't need SFENCE unless you've been using NT stores like movntps [mem], xmm0. (Does the Intel Memory Model make SFENCE and LFENCE redundant? yes). x86's memory model is already program-order + a store-buffer with store forwarding, so every plain load and plain store is an acquire or release operation, and there's no StoreStore reordering of normal stores (to normal memory regions, WB = Write-Back, not video RAM or something).

If you're storing a "data ready" flag after some NT stores (i.e. you want this store to be a release operation wrt. those NT stores), and want your store to be a release operation wrt. those earlier NT stores, you want SFENCE before your store, to make sure a reader that sees this store will also see all this thread's earlier stores.

An SFENCE after a plain store would only prevent later NT stores from appearing before it, but that's certainly not something that's normally a problem even if it did happen.

If you're worried about visibility to other cores, don't be: the store buffer (the primary cause of StoreLoad reordering) already commits data to L1d cache as fast as it can. Barrier instructions like MFENCE don't make data visible to other cores sooner, they just block the current thread's later load/store operations until earlier stores become globally visible by the normal mechanism. If I don't use fences, how long could it take a core to see another core's writes? You usually only need acquire/release semantics which are free on x86, not sequential consistency.

The only reason to use lock cmpxchg for a load would be if your data wasn't aligned. But cache-line-split locks are extremely slow, like locking up memory access for all cores instead of just making the current core hold onto exclusive ownership (MESI) of one cache line. There's a performance counter specifically for split locks, and there's even a recent CPU feature that can make them fault so you can find such problems in VMs without access to HW perf counters.

And if you don't know that your data is aligned, a mov store wouldn't be guaranteed atomic, so it doesn't make sense to suggest that pair of operations. If you want sequential consistency, putting the full barrier on stores almost always makes more sense because loads are more common and can be extremely cheap.

lock cmpxchg8b can be useful on 32-bit x86 to do an atomic 8-byte load or store. But only if you're on a 486: P5 Pentium guarantees that aligned 8-byte load/store are atomic, so at worst you can use x87 fild / fistp to copy to a local on the stack. (Assuming the x87 FPU is set to full precision mode so it can convert any 64-bit bit-pattern to/from 80-bit without loss).

On more recent x86, even in 32-bit mode you can assume at least MMX for movq xmm0, [J] / movd eax, xmm0 / etc. Or SSE2 movq. This is what gcc -m32 uses. Of course 64-bit mode can just use 64-bit integer registers. 16-byte atomic load/store can be done with lock cmpxchg16b. (Aligned SSE is not guaranteed to be atomic, although in practice on the majority of recent CPUs it is. But the corner cases can be tricky, e.g. Why is integer assignment on a naturally aligned variable atomic on x86? links to an example of multi-socket AMD K10 tearing on 8-byte boundaries only between cores on separate sockets.)

How do I use the LOCK ASM prefix to read a value?

2 Answers2

Linked