Assuming architecture is ARM64 or x86-64.
I want to make sure if these two are equivalent:
a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
MyBarrier(); a = *(volatile __int64*)p; MyBarrier();
Where MyBarrier()
is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory")
.
So method 2 is supposed to be faster than method 1.
I heard that _Interlocked()
functions would also imply memory barrier of both compiler and hardware level.
I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?
(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)
Thank you for any advise/correction on this.
Here is some benchmarks on Ivy Bridge (i5 laptop).
(1E+006 loops: 27ms):
; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx
(1E+006 loops: 27ms):
; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]
(1E+006 loops: 7ms):
; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]
(1E+006 loops: 1.26ms, not synchronized?):
; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]