In the Go's memory model nothing is stated about atomics and their relation to memory fencing.
Although many internal packages seem to rely on the memory ordering that could be provided if atomics created memory fences around them. See this issue for details.
After not understanding how it really works, I went to the sources, in particular src/runtime/internal/atomic/atomic_amd64.go and found following implementations of Load
and Store
:
//go:nosplit
//go:noinline
func Load(ptr *uint32) uint32 {
return *ptr
}
Store
is implemented in asm_amd64.s
in the same package.
TEXT runtime∕internal∕atomic·Store(SB), NOSPLIT, $0-12
MOVQ ptr+0(FP), BX
MOVL val+8(FP), AX
XCHGL AX, 0(BX)
RET
Both look as if they had nothing to do with parallelism.
I did look into other architectures but implementation seems to be equivalent.
However, if atomics are indeed weak and provide no memory ordering guarantees, than the code below could fail, but it does not.
As an addition I tried replacing atomic calls with simple assignments but it still produces consistent and "successful" result in both cases.
func try() {
var a, b int32
go func() {
// atomic.StoreInt32(&a, 1)
// atomic.StoreInt32(&b, 1)
a = 1
b = 1
}()
for {
// if n := atomic.LoadInt32(&b); n == 1 {
if n := b; n == 1 {
if a != 1 {
panic("fail")
}
break
}
runtime.Gosched()
}
}
func main() {
n := 1000000000
for i := 0; i < n ; i++ {
try()
}
}
The next thought was that the compiler does some magic to provide ordering guarantees. So below is the listing of the variant with atomic Store
and Load
not commented. Full listing is available on the pastebin.
// Anonymous function implementation with atomic calls inlined
TEXT %22%22.try.func1(SB) gofile../path/atomic.go
atomic.StoreInt32(&a, 1)
0x816 b801000000 MOVL $0x1, AX
0x81b 488b4c2408 MOVQ 0x8(SP), CX
0x820 8701 XCHGL AX, 0(CX)
atomic.StoreInt32(&b, 1)
0x822 b801000000 MOVL $0x1, AX
0x827 488b4c2410 MOVQ 0x10(SP), CX
0x82c 8701 XCHGL AX, 0(CX)
}()
0x82e c3 RET
// Important "cycle" part of try() function
0x6ca e800000000 CALL 0x6cf [1:5]R_CALL:runtime.newproc
for {
0x6cf eb12 JMP 0x6e3
runtime.Gosched()
0x6d1 90 NOPL
checkTimeouts()
0x6d2 90 NOPL
mcall(gosched_m)
0x6d3 488d0500000000 LEAQ 0(IP), AX [3:7]R_PCREL:runtime.gosched_m·f
0x6da 48890424 MOVQ AX, 0(SP)
0x6de e800000000 CALL 0x6e3 [1:5]R_CALL:runtime.mcall
if n := atomic.LoadInt32(&b); n == 1 {
0x6e3 488b442420 MOVQ 0x20(SP), AX
0x6e8 8b08 MOVL 0(AX), CX
0x6ea 83f901 CMPL $0x1, CX
0x6ed 75e2 JNE 0x6d1
if a != 1 {
0x6ef 488b442428 MOVQ 0x28(SP), AX
0x6f4 833801 CMPL $0x1, 0(AX)
0x6f7 750a JNE 0x703
0x6f9 488b6c2430 MOVQ 0x30(SP), BP
0x6fe 4883c438 ADDQ $0x38, SP
0x702 c3 RET
As you can see, no fences or locks are in place again.
Note: all tests are done on x86_64 and i5-8259U
The question:
So, is there any point of wrapping simple pointer dereference in a function call or is there some hidden meaning to it and why do these atomics still work as memory barriers? (if they do)