improve atomic read from InterlockedCompareExchange()

Question

Assuming architecture is ARM64 or x86-64.

I want to make sure if these two are equivalent:

a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
MyBarrier(); a = *(volatile __int64*)p; MyBarrier();

Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory"). So method 2 is supposed to be faster than method 1.

I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.

I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?

(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)

Thank you for any advise/correction on this.

Here is some benchmarks on Ivy Bridge (i5 laptop).

(1E+006 loops: 27ms):

; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx

(1E+006 loops: 27ms):

; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]

(1E+006 loops: 7ms):

; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]

(1E+006 loops: 1.26ms, not synchronized?):

; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]

It is just not equivalent. sfence ensures that the store is visible but doesn't make sure that the load is fresh. So no atomic read at all. mfence is equivalent, good odds that it won't make any difference anymore. Maybe you meant lfence, hard to tell. — Hans Passant, Nov 23 '18 at 10:43

vgru · Answer 1 · 2018-11-22T10:44:32.067

1

For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.

However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.

When you create this replacement, you basically end up with pretty much the same result:

// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val

Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.

However, with two memory barriers, the "optimized version" is actually around 2x slower.

So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.

edited Nov 22 '18 at 10:44

answered Nov 22 '18 at 10:38

vgru

49,838
16
120
201

And btw, `__int64`? You should stick to standard typedefs from [`stdint.h`/`cstdint`](http://www.cplusplus.com/reference/cstdint/). – vgru Nov 22 '18 at 11:25
I am so sorry that, I previously used misleading `_MemoryBarrier()` instead of `MyBarrier()`. I am not using microsoft's macro MemoryBarrier(). So the updated asm code for the 2nd version (the "optimized version"), should not include `lock or DWORD PTR [rsp], r8d`, which is emited by `MemoryBarrier()`. – cozmoz Nov 22 '18 at 11:26
Interlocked functions are easy to understand. And I **personally** hate to use `std::atomic`, which is too complex to me. – cozmoz Nov 22 '18 at 11:31
@cozmoz: in that case, the resulting code will not guarantee that other threads will see values being updated in the program order. Anyway, as a C++ programmer, you should really take a moment of your time and read the docs for `std::atomic`. It's standard, it works, and, most of all, it lets you convey your intents explicitly. Do you only need an atomic read? Use `memory_order_relaxed`. Do you need to publish the changes across all threads with sequential consistency? Use `memory_order_seq_cst`. Right now, you are placing performance optimizations above code correctness and clarity. – vgru Nov 22 '18 at 12:41
Thanks. I always use interlocked functions to modify shared variables, so there's no problem in the producer threads. Since atomicity is never a problem with ARM64/x86-64, so the only require is to read true value in consumer threads. The question is if the variable is modified by some interlocked function in some producer thread, does the updated value immediately visibile in another viewer thread by a simple volatile read? – cozmoz Nov 23 '18 at 09:07
using _mm_sfence() is even much faster than __faststorefence()/MemoryBarrier() on my laptop, not sure this still hold for newer CPUs. And I still not sure whether _mm_sfence() is needed before an read if all modifications to the shared variable are by locked instructions. – cozmoz Nov 23 '18 at 10:35
That's right, x86 does not do so many reorderings (only [loads with earlier stores](https://stackoverflow.com/a/6623662/69809)), so if you issue a `mfence` after a store it's fine. That's also what gcc does when you use load/store on an atomic variable (store followed by `mfence`, and a plain load). [This thread](https://stackoverflow.com/q/19047327/69809) with exactly this question, now that you've updated your question it seems that this was your actual question: do I need barriers when loading. Regarding `sfence`, it does not prevent instruction reordering, so that's why it runs faster. – vgru Nov 23 '18 at 10:59
That's why I'd (again) recommend using `std::atomic` and letting C++ provide the best assembly for your architecture btw. :) – vgru Nov 23 '18 at 11:05

improve atomic read from InterlockedCompareExchange()

1 Answers1