3

First, I found this question: How do I atomically read a value in x86 ASM? But its a bit different, in my case I want to atomically assign a float (64bit double) value in a 32bit application.

From: "Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual, Volume3A"

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

Reading or writing a quadword aligned on a 64-bit boundary

Is it actually possible using some assembly trick?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Sargis
  • 143
  • 1
  • 9
  • 1
    Hint: read up on memory barrier/memory fence. Btw; why is this question tagged "c++"? – Jesper Juhl Jan 01 '18 at 01:11
  • 1
    Are you asking how you get the quadword suitably aligned in memory or are you asking how to store a quadword in memory? – Ross Ridge Jan 01 '18 at 01:14
  • I need each operation to be made atomic. I dont care If the value is changed between I read it and write it, but I want `a = b;` that a always contains a correct value of any state of the b value – Sargis Jan 01 '18 at 01:21
  • 1
    The only "trick" is that the read and write need to be 64bit, so you can use x87 instructions or MMX instructions or SSE instructions. – harold Jan 01 '18 at 01:25

1 Answers1

8

In 64-bit x86 asm, you can use integer mov rax, [rsi], or x87 or SSE2. As long as the address is 8-byte aligned (or on Intel P6 and later CPUs: doesn't cross a cache-line boundary) the load or store will be atomic.

Note that the common baseline across AMD and Intel is still only 8-byte aligned; only Intel guarantees atomicity for a cacheable load that's misaligned but not split across cache lines. (AMD may guarantee something with a wider boundary, or at least do that in practice for some later CPUs).


In 32-bit x86 asm, your only option using only integer registers is lock cmpxchg8b, but that sucks for a pure-load or pure-store. (You can use it as a load by setting expected=desired = 0, except on read-only memory though). (gcc/clang use lock cmpxchg16b for atomic<struct_16_bytes> in 64-bit mode, but some compilers simply choose to make 16-byte objects not lock-free.)

So the answer is: don't use integer regs: fild qword / fistp qword can copy any bit-pattern without changing it. (As long as the x87 precision control is set to full 64-bit mantissa). This is atomic for aligned addresses on Pentium and later.

On a modern x86, use SSE2 movq load or store. e.g.

; atomically store edx:eax to qword [edi], assuming [edi] is 8-byte aligned
movd   xmm0, eax
pinsrd xmm0, edx            ; SSE4.1
movq   [edi], xmm0

With only SSE1 available, use movlps. (For loads, you may want to break the false-dependency on the old value of the xmm register with xorps).

With MMX, movq to/from mm0-7 works.


gcc uses SSE2 movq, SSE1 movlps, or x87 fild/fstp in that order of preference for std::atomic<int64_t> in 32-bit mode. Clang -m32 unfortunately uses lock cmpxchg8b even when SSE2 is available: LLVM bug 33109. .

Some versions of gcc are configured so that -msse2 is on by default even with -m32 (in which case you could use -mno-sse2 or -march=i486 to see what gcc does without it).

I put load and store functions on the Godbolt compiler explorer to see asm from gcc with x87, SSE, and SSE2. And from clang4.0.1 and ICC18.

gcc bounces through memory as part of int->xmm or xmm->int, even when SSE4 (pinsrd / pextrd) is available. This is a missed-optimization (gcc bug 80833). In 64-bit mode it favours ALU movd + pinsrd / pextrd with -mtune=intel or -mtune=haswell, but apparently not in 32-bit mode or not for this use-case (64-bit integers in XMM instead of proper vectorization). Anyway, remember that only the load or store from atomic<long long> shared has to be atomic, the other loads/stores to the stack are private.


In MSVC, there's an __iso_volatile_load64 intrinsic in later versions of Visual C++ 2019 that can compile to an appropriate sequence of instructions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • For storing, it emits a `cmpxchg8b` loop though. At least the last time I tested. – fuz Jan 01 '18 at 12:19
  • @fuz: you were probably looking at clang, which unfortunately does that. Or you were looking at a RWM operation, not a pure-load or pure-store. Updated the answer with a godbolt link. – Peter Cordes Jan 01 '18 at 18:35