What is the point of atomic.Load and atomic.Store

Question

In the Go's memory model nothing is stated about atomics and their relation to memory fencing.

Although many internal packages seem to rely on the memory ordering that could be provided if atomics created memory fences around them. See this issue for details.

After not understanding how it really works, I went to the sources, in particular src/runtime/internal/atomic/atomic_amd64.go and found following implementations of Load and Store:

//go:nosplit
//go:noinline
func Load(ptr *uint32) uint32 {
    return *ptr
}

Store is implemented in asm_amd64.s in the same package.

TEXT runtime∕internal∕atomic·Store(SB), NOSPLIT, $0-12
    MOVQ    ptr+0(FP), BX
    MOVL    val+8(FP), AX
    XCHGL   AX, 0(BX)
    RET

Both look as if they had nothing to do with parallelism.

I did look into other architectures but implementation seems to be equivalent.

However, if atomics are indeed weak and provide no memory ordering guarantees, than the code below could fail, but it does not.

As an addition I tried replacing atomic calls with simple assignments but it still produces consistent and "successful" result in both cases.


func try() {
    var a, b int32

    go func() {
        // atomic.StoreInt32(&a, 1)
        // atomic.StoreInt32(&b, 1)
        a = 1
        b = 1
    }()

    for {
        // if n := atomic.LoadInt32(&b); n == 1 {
        if n := b; n == 1 {
            if a != 1 {
                panic("fail")
            }
            break
        }
        runtime.Gosched()
    }
}

func main() {
    n := 1000000000
    for i := 0; i < n ; i++ {
        try()
    }
}

The next thought was that the compiler does some magic to provide ordering guarantees. So below is the listing of the variant with atomic Store and Load not commented. Full listing is available on the pastebin.

// Anonymous function implementation with atomic calls inlined

TEXT %22%22.try.func1(SB) gofile../path/atomic.go
        atomic.StoreInt32(&a, 1)
  0x816         b801000000      MOVL $0x1, AX
  0x81b         488b4c2408      MOVQ 0x8(SP), CX
  0x820         8701            XCHGL AX, 0(CX)
        atomic.StoreInt32(&b, 1)
  0x822         b801000000      MOVL $0x1, AX
  0x827         488b4c2410      MOVQ 0x10(SP), CX
  0x82c         8701            XCHGL AX, 0(CX)
    }()
  0x82e         c3          RET

// Important "cycle" part of try() function

 0x6ca          e800000000      CALL 0x6cf      [1:5]R_CALL:runtime.newproc
    for {
  0x6cf         eb12            JMP 0x6e3
        runtime.Gosched()
  0x6d1         90          NOPL
    checkTimeouts()
  0x6d2         90          NOPL
    mcall(gosched_m)
  0x6d3         488d0500000000      LEAQ 0(IP), AX      [3:7]R_PCREL:runtime.gosched_m·f
  0x6da         48890424        MOVQ AX, 0(SP)
  0x6de         e800000000      CALL 0x6e3      [1:5]R_CALL:runtime.mcall
        if n := atomic.LoadInt32(&b); n == 1 {
  0x6e3         488b442420      MOVQ 0x20(SP), AX
  0x6e8         8b08            MOVL 0(AX), CX
  0x6ea         83f901          CMPL $0x1, CX
  0x6ed         75e2            JNE 0x6d1
            if a != 1 {
  0x6ef         488b442428      MOVQ 0x28(SP), AX
  0x6f4         833801          CMPL $0x1, 0(AX)
  0x6f7         750a            JNE 0x703
  0x6f9         488b6c2430      MOVQ 0x30(SP), BP
  0x6fe         4883c438        ADDQ $0x38, SP
  0x702         c3          RET

As you can see, no fences or locks are in place again.

Note: all tests are done on x86_64 and i5-8259U

The question:

So, is there any point of wrapping simple pointer dereference in a function call or is there some hidden meaning to it and why do these atomics still work as memory barriers? (if they do)

The XCHG instruction with a memory operand has an implicit LOCK which provides additional ordering guarantees over and above the x86's already strong default ordering of memory accesses. The fact that they've used this instruction instead of the simple MOV to memory instruction necessary for an atomic store suggests that these additional guarantees are required for `atomic.StoreInt32`. — Ross Ridge, Oct 27 '19 at 17:55
@RossRidge Could you elaborate on strong default memory ordering in x86? Could n't find any information on that. Could this be the reason why the example given in the second half of the question works even without atomics? — tna0y, Oct 27 '19 at 18:02
Sorry, I don't really know Go, so can't comment on your code, but you can see a summary of the x86 guarantees here: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/ — Ross Ridge, Oct 27 '19 at 18:18

Peter Cordes · Accepted Answer · 2019-10-27T18:46:29.473

I don't know Go at all, but it looks like the x86-64 implementations of .load() and .store() are sequentially-consistent. Presumably on purpose / for a reason!

//go:noinline on the load means the compiler can't reorder around a blackbox non-inline function, I assume. On x86 that's all you need for the load side of sequential-consistency, or acq-rel. A plain x86 mov load is an acquire load.

The compiler-generated code gets to take advantage of x86's strongly-ordered memory model, which is sequential consistency + a store buffer (with store forwarding), i.e. acq/rel. To recover sequential consistency, you only need to drain the store buffer after a release-store.

.store() is written in asm, loading its stack args and using xchg as a seq-cst store.

XCHG with memory has an implicit lock prefix which is a full barrier; it's an efficient alternative to mov+mfence to implement what C++ would call a memory_order_seq_cst store.

It flushes the store buffer before later loads and stores are allowed to touch L1d cache. Why does a std::atomic store with sequential consistency use XCHG?

See

https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
C/C++11 mappings to processors describes the sequences of instructions that implement relaxed load/store, acq/rel load/store, seq-cst load/store, and various barriers, on various ISAs. So you can recognize things like xchg with memory.
Does lock xchg have the same behavior as mfence? (TL:DR: yes except for maybe some corner cases with NT loads from WC memory, e.g. from video RAM). You may see a dummy lock add $0, (SP) used as an alternative to mfence in some code.

IIRC, AMD's optimization manual even recommends this. It's good on Intel as well, especially on Skylake where mfence was strengthened by microcode update to fully block out-of-order exec even of ALU instructions (like lfence) as well as memory reordering. (To fix an erratum with NT loads.)
https://preshing.com/20120913/acquire-and-release-semantics/

What is the point of atomic.Load and atomic.Store

1 Answers1