9

A full/general memory barrier is one where all the LOAD and STORE operations specified before the barrier will appear to happen before all the LOAD and STORE operations specified after the barrier with respect to the other components of the system.

According to cppreference, memory_order_seq_cst is equal to memory_order_acq_rel plus a single total modification order on all operations so tagged. But as far as I know, neither acquire nor release fence in C++11 enforces a #StoreLoad (load after store) ordering. A release fence requires that no previous read/write can be reordered with any following write; An acquire fence requires that no following read/write can be reordered with any previous read. Please correct me if I am wrong;)

Giving an example,

atomic<int> x;
atomic<int> y;

y.store(1, memory_order_relaxed);            //(1)
atomic_thread_fence(memory_order_seq_cst);   //(2)
x.load(memory_order_relaxed);                //(3)

Is it allowed by a optimizing compiler to reorder instruction (3) to before (1) so that it effective looks like:

x.load(memory_order_relaxed);                //(3)
y.store(1, memory_order_relaxed);            //(1)
atomic_thread_fence(memory_order_seq_cst);   //(2)

If this is a valid tranformation, then it proves that atomic_thread_fence(memory_order_seq_cst) doesn't not necessarily encompass the semantics of what a full barrier has.

curiousguy
  • 8,038
  • 2
  • 40
  • 58
Eric Z
  • 14,327
  • 7
  • 45
  • 69
  • 2
    Seems you are right in the conclusion. `memory_order_seq_cst` is weaker requirement that full barrier. This, definitely, doesn't forbid to implement it as full barrier in "traditional" architectures which function in terms of barriers. – Netch Aug 25 '14 at 09:17
  • I find a relevant [article](http://www.modernescpp.com/index.php/fences-as-memory-barriers) supporting your idea. – olist Jun 25 '18 at 16:37
  • Related: [How to achieve a StoreLoad barrier in C++11?](https://stackoverflow.com/q/60053973) - a seq_cst store + a seq_cst load of another variable works in practice on real CPUs (I'm pretty sure, even AArch64), but is not guaranteed by ISO C++. – Peter Cordes Apr 05 '22 at 03:14

3 Answers3

7

atomic_thread_fence(memory_order_seq_cst) always generates a full-barrier.

  • x86_64: MFENCE
  • PowerPC: hwsync
  • Itanuim: mf
  • ARMv7 / ARMv8: dmb ish
  • MIPS64: sync

The main thing: observing thread can simply observe in a different order, and will not matter what fences you are using in the observed thread.

Is it allowed by a optimizing compiler to reorder instruction (3) to before (1)?

Not, it isn't allowed. But in globally visible for multithreading programm this is true, only if:

  • other threads use the same memory_order_seq_cst for atomically read/write-operations with these values
  • or if other threads use the same atomic_thread_fence(memory_order_seq_cst); between load() and store() too - but this approach doesn't guarantee sequential consistency in general, because sequential consistency is more strong guarantee

Working Draft, Standard for Programming Language C++ 2016-07-12: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

§ 29.3 Order and consistency

§ 29.3 / 8

[ Note: memory_order_seq_cst ensures sequential consistency only for a program that is free of data races and uses exclusively memory_order_seq_cst operations. Any use of weaker ordering will invalidate this guarantee unless extreme care is used. In particular, memory_order_seq_cst fences ensure a total order only for the fences themselves. Fences cannot, in general, be used to restore sequential consistency for atomic operations with weaker ordering specifications. — end note ]


How it can be mapped to assembler:

Case-1:

atomic<int> x, y

y.store(1, memory_order_relaxed);            //(1)
atomic_thread_fence(memory_order_seq_cst);   //(2)
x.load(memory_order_relaxed);                //(3)

This code isn't always equivalent to the meaning of Case-2, but this code produce the same instructions between STORE & LOAD, as well as if both LOAD and STORE uses memory_order_seq_cst - this is Sequential Consistency which prevents StoreLoad-reordering, Case-2:

atomic<int> x, y;

y.store(1, memory_order_seq_cst);            //(1)

x.load(memory_order_seq_cst);                //(3)

With some notes:

  1. it may add duplicate instructions (as in the following example for MIPS64)
  2. or may use similar operations in the form of other instructions:

Guide for ARMv8-A

Table 13.1. Barrier parameters

ISH Any - Any

Any - Any This means that both loads and stores must complete before the barrier. Both loads and stores that appear after the barrier in program order must wait for the barrier to complete.

Prevent reordering of two instructions can be done by additional instructions between these two. And as we see the first STORE(seq_cst) and next LOAD(seq_cst) generate instructions between its are the same as FENCE(seq_cst) (atomic_thread_fence(memory_order_seq_cst))

Mapping of C/C++11 memory_order_seq_cst to differenct CPU architectures for: load(), store(), atomic_thread_fence():

Note atomic_thread_fence(memory_order_seq_cst); always generates Full-barrier:

  • x86_64: STORE-MOV (into memory),MFENCE, LOAD-MOV (from memory), fence-MFENCE

  • x86_64-alt: STORE-MOV (into memory), LOAD-MFENCE,MOV (from memory), fence-MFENCE

  • x86_64-alt3: STORE-(LOCK) XCHG, LOAD-MOV (from memory), fence-MFENCE - full barrier

  • x86_64-alt4: STORE-MOV (into memory), LOAD-LOCK XADD(0), fence-MFENCE - full barrier

  • PowerPC: STORE-hwsync; st, LOAD-hwsync;ld; cmp; bc; isync, fence-hwsync

  • Itanium: STORE-st.rel;mf, LOAD-ld.acq, fence-mf

  • ARMv7: STORE-dmb ish; str;dmb ish, LOAD-ldr; dmb ish, fence-dmb ish

  • ARMv7-alt: STORE-dmb ish; str, LOAD-dmb ish;ldr; dmb ish, fence-dmb ish

  • ARMv8(AArch32): STORE-STL, LOAD-LDA, fence-DMB ISH - full barrier

  • ARMv8(AArch64): STORE-STLR, LOAD-LDAR, fence-DMB ISH - full barrier

  • MIPS64: STORE-sync; sw;sync;, LOAD-sync; lw; sync;, fence-sync

There are described all mapping of C/C++11 semantics to differenct CPU architectures for: load(), store(), atomic_thread_fence(): http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

Because Sequential-Consistency prevents StoreLoad-reordering, and because Sequential-Consistency (store(memory_order_seq_cst) and next load(memory_order_seq_cst)) generates instructions between its are the same as atomic_thread_fence(memory_order_seq_cst), then atomic_thread_fence(memory_order_seq_cst) prevents StoreLoad-reordering.

Alex
  • 12,578
  • 15
  • 99
  • 195
0

C++ fences are not direct equivalents of CPU fence instructions, though they may well be implemented as such. C++ fences are part of the C++ memory model, which is all about visibility and ordering constraints.

Given that processors typically reorder reads and writes, and cache values locally before they are made available to other cores or processors, the order in which effects become visible to other processors is not usually predictable.

When thinking about these semantics, it is important therefore to think about what it is that you are trying to prevent.

Let's assume that the code is mapped to machine instructions as written, (1) then (2) then (3), and these instructions guarantee that (1) is globally visible before (3) is executed.

The whole purpose of the snippet is to communicate with another thread. You cannot guarantee that the other thread is running on any processor at the time that this snippet executes on our processor. Therefore the whole snippet may run uninterrupted, and (3) will still read whatever value was in x when (1) was executed. In this case, it is indistinguishable from an execution order of (3) (1) (2).

So: yes, this is an allowed optimization, because you cannot tell the difference.

Anthony Williams
  • 66,628
  • 14
  • 133
  • 155
  • Does this rely on the assumption that "There is no other thread"? – yohjp Dec 09 '14 at 03:52
  • No. However, reasoning about these things is hard, and reasoning about one thread in isolation nigh on impossible. The whole point of fences and memory ordering constraints is to order things in relation to operations performed on other threads. You'd really need the code for the other threads to truly see. – Anthony Williams Dec 09 '14 at 17:40
  • @AnthonyWilliams If this is allowed, what happens if another thread reverses `x` and `y` and executes (in program order) `x.store(1, relaxed); fence(seq_cst); y.load(relaxed);` If that is optimized the same way, you will get `y.load(relaxed); x.store(1, relaxed); fence(seq_cst)`. Now both threads begin with the `load`. Doesn't that create the possibility that both read `0` (which should not be possible) ? – LWimsey Apr 17 '17 at 05:45
  • Yes: if there is another thread that does the reverse, then the compiler needs to ensure that the instructions it generates for both threads will ensure the correct visibility, and reading 0 from both threads won't happen (due to [atomics.order]p6 http://eel.is/c++draft/atomics.order#6). Quite how it does that is up to the compiler: it does not have to generate symmetrical code for the two cases, though it likely will. "Reasoning about one thread in isolation nigh on impossible" – Anthony Williams Apr 18 '17 at 08:01
0

According to Herb Sutter's talk (see about time 45:00), std::memory_order_seq_cst will enforce a StoreLoad, unlike std::memory_order_acq_rel.

abc
  • 212
  • 3
  • 14
  • I disagree, he doesn't say that StoreLoad barrier is enforced, only that there is a barrier that prevents reordering of atomic variables. StoreLoad barrier would prevent reordering of all variables which certainly isn't the case with acquire/release semantics – Marin Veršić Jun 26 '20 at 10:43
  • A *fence* with seq_cst will order all atomic operations before/after, yes. But a seq_cst *operation* like `x.store(val, seq_cst)` will not, except as an implementation detail on many ISAs other than AArch64. – Peter Cordes Apr 05 '22 at 03:16