Can_mm_store_si128 / _mm_load_si128 intrinsics be used to implement 128 bit atomic type?

Question

If I want to implement 128-bit atomic type on x64, can I get with _mm_store_si128 and _mm_load_si128 to avoid cmpxchg16b for relaxed load and store?

(If needed, can assume that only load and store are needed, although it would be good if I can mix those with cmpxchg16b)

On some real hardware it's widely believed that the answer is yes, but CPU vendors unfortunately provide no way to query that it's safe or any published guarantees for any CPU that it's safe. It can be tricky, e.g. multi-socket AMD K10 systems have tearing at 8-byte boundaries only between cores on different sockets (over Hypertransport). — Peter Cordes, Apr 14 '20 at 05:37
But otherwise yes, `movdqa` load/store in asm is acquire / release, and in practice atomic on recent CPUs. And on CPUs that have `lock cmpxchg16b`, GCC/clang at least use it instead of locking for 16-byte atomics either by inlining it or with code in `libatomic` (gcc7 changed to calling it non-lock-free because reading via `lock cmpxchg16b` means readers still contend with each other so the usual perf expectations of lock-free aren't met). So in that case yes mixing SIMD with `atomic` would be safe on most hardware. If you only care about one server, test on that server. — Peter Cordes, Apr 14 '20 at 05:46
I could reopen this and write a short answer about mixing intrinsics with `atomic` on current gcc/clang, if you still want to actually try it despite lack of any future guarantees that it's safe. (or documentation for current behaviour). I closed this because lack of safety guarantees should be a showstopper for most code, including any code you want to distribute to others. If it was known safe on some/any CPUs, GCC/clang would already use `movdqa` for 16-byte pure-load / pure-store the way GCC uses `movq` for 8-byte in 32-bit mode. (GCC even uses `fild` if SSE1 isn't available) — Peter Cordes, Apr 14 '20 at 06:02
I was dissatisfied with 128-bit atomic both based on `cmpxchg16b` (in Boost) and on locks (ships with compiler). So I worked around by not using 128-bit atomic. But I observed that on x86 64-bit atomic is implemented using SSE or `fild`, as you mentioned (though I observe other compiler behavior, not GCC), so I was curious about SSE for 128 bit. — Alex Guteniev, Apr 14 '20 at 06:34
GCC7 and later won't *inline* `lock cmpxchg16b`, but it still uses it via libatomic. It `__atomic_load_16` or whatever it's called in libatomic doesn't actually take a lock on CPUs with that feature; single step into `atomic<__int128>.load` if you're curious. I think clang will inline it if you compile with `-mcx16` or `-march=anything` except k8. — Peter Cordes, Apr 14 '20 at 06:41
I was looking into relaxed memory loads and stores that do not enforce memory fence. Inlining does not change much in this regard. Though in MSVC 2019 everything is inlined, including `_InterlockedCompareExchange64` and `__iso_volatile_load64` for 64-bit atomic on x86, locks for 128-bit atomic on x64, and `_InterlockedCompareExchange128` for 128-bit `boost::atomic` on x64. — Alex Guteniev, Apr 14 '20 at 06:58
Oh yes, IIRC MSVC just locks for 16 byte objects, missing the (probably minor) optimization of using `lock cmpxchg` directly as a load. Re: fences: the x86 / x86-64 memory model is program order + a store buffer with store forwarding. Compilers get acquire and release for free just by blocking compile-time reordering. (You could maybe use `atomic_signal_fence()` as a hack for a portable compiler-only barrier, or of course just use `atomic_thread_fence(mo_release)`). You only need an mfence instruction after seq_cst stores. (Or just do them with `xchg` which is a full barrier). — Peter Cordes, Apr 14 '20 at 07:08
In MSVC compiler fences are `_ReadWriteBarrier`, `std::atomic_thread_fence(seq_cst)` is `lock cmpxchg` with dummy variable. And I wanted to avoid any sort of interlocked or `*fence` instruction for loads/stores. As I undestand, until they are releasing ABI-breaking toolset, `cmpxchg16b` cannot be introduced in their `std::atomic<128b>`, even guarded with CPU detection. — Alex Guteniev, Apr 14 '20 at 07:33
You're talking about ways to write a full barrier in C++ (which compiles to barrier asm instruction). I'm talking about ways to *just* block compile-time reordering, leaving runtime ordering up to the hardware. Most people call that a "compiler barrier". https://preshing.com/20120625/memory-ordering-at-compile-time/ For example `atomic_thread_fence(mo_acq_rel)` on x86-64 is zero asm instructions, or a non-inline function call. — Peter Cordes, Apr 14 '20 at 15:51
Re: changing `std::atomic<128b>` to use `lock cmpxchg16b` - correct, that would be an ABI-breaking change for MSVC because existing code uses a separate lock. But GCC/clang on x86-64 Linux at least work the way I described. https://godbolt.org/z/yp8GYX clang 10.0 still inlines `lock cmpxchg16b` with a `-march` that allows it. — Peter Cordes, Apr 14 '20 at 15:55

Can_mm_store_si128 / _mm_load_si128 intrinsics be used to implement 128 bit atomic type?

0 Answers0