Normally you can make sure J
is sufficiently aligned (e.g. naturally aligned).
Then plain mov
is sufficient for pure-load or pure-store,
and much more efficient than lock
-anything in the uncontended case.
GJ's answer quotes the relevant part of Intel's manual re: alignment, same as in Why is integer assignment on a naturally aligned variable atomic on x86? Note that the common subset that's atomic on AMD as well is not as forgiving as just Intel: AMD can tear across boundaries narrows than a cache line, but naturally-aligned 8-byte load/store are safe on both.
If you're familiar with C++11 std::atomic memory_order_acquire / _release and seq_cst, see the mappings to asm for various ISAs: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. Or look at compiler output for stuff like x.store(1, std::memory_order_release)
on https://godbolt.org/
default rel
section .bss
align 4 ; natural alignment
J: resd 1 ; reserve 1 DWORD (NASM syntax)
section .text
mov eax, [J] ; read J (acquire semantics)
mov [J], eax ; write J (release semantics)
;;; seq_cst write J and wait for it to be globally visible before later loads (and stores, but that already happens with mov)
xchg [J], eax ; implicit LOCK prefix, full memory barrier.
The seq_cst store could also be done with mov [J], eax
+ mfence
, but that's usually slower on most CPUs; GCC recently switched to using XCHG, like other compilers have been doing for a while. In fact, MFENCE is so slow on Skylake that it can be better to use lock or byte [rsp], 0
instead of mfence
when you need a barrier separate from a store. (atomic_thread_fence(mo_seq_cst)
)
Both parts of @GJ's suggested code are unnecessarily slow, unfortunately.
You also don't need SFENCE unless you've been using NT stores like movntps [mem], xmm0
. (Does the Intel Memory Model make SFENCE and LFENCE redundant? yes). x86's memory model is already program-order + a store-buffer with store forwarding, so every plain load and plain store is an acquire or release operation, and there's no StoreStore reordering of normal stores (to normal memory regions, WB = Write-Back, not video RAM or something).
If you're storing a "data ready" flag after some NT stores (i.e. you want this store to be a release operation wrt. those NT stores), and want your store to be a release operation wrt. those earlier NT stores, you want SFENCE before your store, to make sure a reader that sees this store will also see all this thread's earlier stores.
An SFENCE after a plain store would only prevent later NT stores from appearing before it, but that's certainly not something that's normally a problem even if it did happen.
If you're worried about visibility to other cores, don't be: the store buffer (the primary cause of StoreLoad reordering) already commits data to L1d cache as fast as it can. Barrier instructions like MFENCE don't make data visible to other cores sooner, they just block the current thread's later load/store operations until earlier stores become globally visible by the normal mechanism.
If I don't use fences, how long could it take a core to see another core's writes? You usually only need acquire/release semantics which are free on x86, not sequential consistency.
The only reason to use lock cmpxchg
for a load would be if your data wasn't aligned. But cache-line-split locks are extremely slow, like locking up memory access for all cores instead of just making the current core hold onto exclusive ownership (MESI) of one cache line. There's a performance counter specifically for split locks, and there's even a recent CPU feature that can make them fault so you can find such problems in VMs without access to HW perf counters.
And if you don't know that your data is aligned, a mov
store wouldn't be guaranteed atomic, so it doesn't make sense to suggest that pair of operations. If you want sequential consistency, putting the full barrier on stores almost always makes more sense because loads are more common and can be extremely cheap.
lock cmpxchg8b
can be useful on 32-bit x86 to do an atomic 8-byte load or store. But only if you're on a 486: P5 Pentium guarantees that aligned 8-byte load/store are atomic, so at worst you can use x87 fild
/ fistp
to copy to a local on the stack. (Assuming the x87 FPU is set to full precision mode so it can convert any 64-bit bit-pattern to/from 80-bit without loss).
On more recent x86, even in 32-bit mode you can assume at least MMX for movq xmm0, [J]
/ movd eax, xmm0
/ etc. Or SSE2 movq
. This is what gcc -m32
uses. Of course 64-bit mode can just use 64-bit integer registers. 16-byte atomic load/store can be done with lock cmpxchg16b
. (Aligned SSE is not guaranteed to be atomic, although in practice on the majority of recent CPUs it is. But the corner cases can be tricky, e.g. Why is integer assignment on a naturally aligned variable atomic on x86? links to an example of multi-socket AMD K10 tearing on 8-byte boundaries only between cores on separate sockets.)