In 64-bit x86 asm, you can use integer mov rax, [rsi]
, or x87 or SSE2. As long as the address is 8-byte aligned (or on Intel P6 and later CPUs: doesn't cross a cache-line boundary) the load or store will be atomic.
Note that the common baseline across AMD and Intel is still only 8-byte aligned; only Intel guarantees atomicity for a cacheable load that's misaligned but not split across cache lines. (AMD may guarantee something with a wider boundary, or at least do that in practice for some later CPUs).
In 32-bit x86 asm, your only option using only integer registers is lock cmpxchg8b
, but that sucks for a pure-load or pure-store. (You can use it as a load by setting expected=desired = 0, except on read-only memory though). (gcc/clang use lock cmpxchg16b
for atomic<struct_16_bytes>
in 64-bit mode, but some compilers simply choose to make 16-byte objects not lock-free.)
So the answer is: don't use integer regs: fild qword
/ fistp qword
can copy any bit-pattern without changing it. (As long as the x87 precision control is set to full 64-bit mantissa). This is atomic for aligned addresses on Pentium and later.
On a modern x86, use SSE2 movq
load or store. e.g.
; atomically store edx:eax to qword [edi], assuming [edi] is 8-byte aligned
movd xmm0, eax
pinsrd xmm0, edx ; SSE4.1
movq [edi], xmm0
With only SSE1 available, use movlps
. (For loads, you may want to break the false-dependency on the old value of the xmm register with xorps
).
With MMX, movq
to/from mm0-7
works.
gcc uses SSE2 movq
, SSE1 movlps
, or x87 fild
/fstp
in that order of preference for std::atomic<int64_t>
in 32-bit mode. Clang -m32
unfortunately uses lock cmpxchg8b
even when SSE2 is available: LLVM bug 33109. .
Some versions of gcc are configured so that -msse2
is on by default even with -m32
(in which case you could use -mno-sse2
or -march=i486
to see what gcc does without it).
I put load and store functions on the Godbolt compiler explorer to see asm from gcc with x87, SSE, and SSE2. And from clang4.0.1 and ICC18.
gcc bounces through memory as part of int->xmm or xmm->int, even when SSE4 (pinsrd
/ pextrd
) is available. This is a missed-optimization (gcc bug 80833). In 64-bit mode it favours ALU movd + pinsrd / pextrd with -mtune=intel
or -mtune=haswell
, but apparently not in 32-bit mode or not for this use-case (64-bit integers in XMM instead of proper vectorization). Anyway, remember that only the load or store from atomic<long long> shared
has to be atomic, the other loads/stores to the stack are private.
In MSVC, there's an __iso_volatile_load64
intrinsic in later versions of Visual C++ 2019 that can compile to an appropriate sequence of instructions.