Questioning validity of PowerPC barriers in GCC-generated atomics

Question

GCC implements __sync_val_compare_and_swap on PowerPC[64] as:

    sync
1:  lwarx 9,0,3
    cmpw 0,9,4
    bne 0,2f
    stwcx. 5,0,3
    bne 0,1b
2:  isync

GCC documents for the __sync_* builtins:

In most cases, these builtins are considered a full barrier. That is, no memory operand will be moved across the operation, either forward or backward. Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.

However the use of isync rather than sync at the end is bothering me. Is this actually a full barrier? Or:

Could loads performed after the __sync_val_compare_and_swap fail to see stores performed before the store that produced the value __sync_val_compare_and_swap loaded?
Could stores performed after the __sync_val_compare_and_swap be seen by other threads before they see the value stored by the __sync_val_compare_and_swap?

If using GCC >= 4.7, `__atomic_*` builtins are preferred as they lets you choose C11/C++11 memory model (consume, acquire, release, both or sequentially consistent) — minmaxavg, Sep 14 '18 at 03:02
@minmaxavg: I'm asking specifically about the `__sync` one where I **want** the full-barrier property that's stronger than the C11 memory model and I'm not clear that GCC is actually providing it. — R.. GitHub STOP HELPING ICE, Sep 14 '18 at 03:03
The `__ATOMIC_SEQ_CST` does provide the full barrier property you want. Besides, I'm also looking for the answer to this question since I'm curious about this one too. — minmaxavg, Sep 14 '18 at 03:06
@minmaxavg: It does not produce any difference from the `__sync` version. — R.. GitHub STOP HELPING ICE, Sep 14 '18 at 03:17
Related: [Does \`isync\` prevent Store-Load reordering on CPU PowerPC?](https://stackoverflow.com/q/43944411). I haven't read it fully. If `__ATOMIC_SEQ_CST` produces the same asm, then presumably there's some reason. I think seq-cst requires that later loads/stores can't become visible before the store part of the CAS. — Peter Cordes, Sep 14 '18 at 03:17
@PeterCordes: No, it's the same. I saw that other question but was unsure if it answers mine, since maybe the `stwcx.` is doing some magic that makes it work. — R.. GitHub STOP HELPING ICE, Sep 14 '18 at 03:18
[stwcx. is the write part of the read-modify-write primitive on PPC](https://www.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.alangref/idalangref_stwcx_instrs.htm). — A. Wilcox, Sep 14 '18 at 03:22
My apologies, I've somehow mistaken that you assumed `__sync` to provide a full barrier. Yes, it *is* the same as the `__sync`* version (and also kinda acts as a fallback). *edit — minmaxavg, Sep 14 '18 at 03:22
@A.Wilcox: yes, but does it have any ordering semantics at all, stronger than relaxed? — Peter Cordes, Sep 14 '18 at 03:22
it's the set of barriers PowerPC uses for a Seq/Cst read-modify-write. `isync` prevents speculative execution from accessing earlier operations (acquire), `lwsync` is used for 'release' guarantees and is replaced by `sync` in case of a seq/cst operation. — LWimsey, Sep 14 '18 at 03:24
@LWimsey: But to be a real seq_cst atomic, barriers are generally needed **on both sides** of it, not just before it. I just tested GCC's `__atomic_store` with seq_cst for ppc64 and it's totally wrong -- it's only a release barrier (`sync;stw`). — R.. GitHub STOP HELPING ICE, Sep 14 '18 at 03:27
@R..I think the mistake here is the believe that seq/cst atomic operations act as a full barrier; they do not.. The guarantee for an SC atomic store is that it has release semantics, an SC atomic load has acquire semantics and in addition, SC operations follow a global order wrt each other, but in isolation, SC operations are not full barriers. — LWimsey, Sep 14 '18 at 03:33
@LWimsey: What part of the spec allows them to only be acquire or release? I thought they had to be ordered with respect to other relaxed-order atomics? — R.. GitHub STOP HELPING ICE, Sep 14 '18 at 03:37
FYI `__atomic_store` with `__ATOMIC_SEQ_CST` and `__ATOMIC_RELEASE` seems to use `sync` and `lwsync`, respectively. Sequentially consistent ordering only guarantees total order of memory operations wrt other `__ATOMIC_SEQ_CST` operations, not relaxed ones. I think the GCC's documentation for `__sync` is indeed a bit misleading. So was I, who probably do need to have a cup of coffee after skipping over a night :/ — minmaxavg, Sep 14 '18 at 03:40
@R.. My comment was about seq/cst atomic loads (acquire) and stores (release). A seq/cst read-modify-write operation has both acquire and release semantics and therefore, a relaxed operation sequenced before (or after) a seq/cst RMW must be observed by other threads in the same order. The PowerPC barriers in your question enforce that behavior. — LWimsey, Sep 14 '18 at 04:13
In a [not related answer](https://stackoverflow.com/questions/47520748/c-memory-model-do-seq-cst-loads-synchronize-with-seq-cst-stores/47522708#47522708), I included some references to the C++ standard. You've used the C-tag, but it's my understanding that the memory models for both languages are similar (if not equivalent). — LWimsey, Sep 14 '18 at 04:13
@LWimsey: Thanks, that's very helpful. Unless I'm misunderstanding something though it looks like the atomic CAS here lacks acquire semantics too, which is a big problem if it's being used to implement a lock where access to non-atomic objects should be synchronized by it, no? — R.. GitHub STOP HELPING ICE, Sep 14 '18 at 14:23
(Maybe the release barrier at the previous seq_cst atomic that unlocked it guarantees these the acquire semantics here, but if so I don't understand how that works in the hardware memory model.) — R.. GitHub STOP HELPING ICE, Sep 14 '18 at 14:43
[Example POWER Implementation for C/C++ Memory Model](http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html). isync is purely an instruction re-ordering prevention barrier and not a memory barrier at all. — danblack, Sep 21 '18 at 07:41
@danblack "_isync is purely an instruction re-ordering prevention barrier and not a memory barrier at all_" What's the difference? — curiousguy, Nov 28 '19 at 00:00
@R.. "_But to be a real seq_cst atomic, barriers are generally needed on both sides of it, not just before it._" But you only showed one atomic operation. What happens if you put more operations in sequence? Do barriers appear on both side? — curiousguy, Nov 28 '19 at 02:00
@curiousguy: You're missing the point that the other operations happen on other cores, not inline with this one or visible to the compile emitting this asm. What `isync` being purely an instruction reordering barrier, not a memory barrier, is that it has no influence on synchronization of memory between cores. — R.. GitHub STOP HELPING ICE, Nov 28 '19 at 02:17
@R.. I never realized that memory was explicitly synchronized between cores. — curiousguy, Nov 28 '19 at 03:11
@curiousguy: It's necessary whenever you have cache, which is necessary for a computer not to be something like 1000x slower. See https://en.wikipedia.org/wiki/Cache_coherence — R.. GitHub STOP HELPING ICE, Nov 28 '19 at 03:24
@R.. Without a cache, we would program by explicitly addressing many regions of memory, some faster than others. It would be a lot more complicated but not 1000x slower. — curiousguy, Nov 28 '19 at 20:47
@curiousguy: Nobody did that back in the days when cache was new, and it's not a viable programming model. It is potentially viable to do it via the MMU (essentially, treat SRAM as the only native memory the MMU maps and DRAM as a block device, using the OS page cache layer in place of hardware cache) but the idea is the same. In any case the 1000x figure is pretty close to accurate. Booting modern Windows with cache disabled takes something like half a day. — R.. GitHub STOP HELPING ICE, Nov 28 '19 at 20:52
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203276/discussion-between-curiousguy-and-r). — curiousguy, Nov 28 '19 at 20:55

Questioning validity of PowerPC barriers in GCC-generated atomics

0 Answers0