How to guarantee that load completes before store occurs?

Question

In the following code, how could one ensure that ptr not incremented until after *ptr has been loaded/assigned/"extracted"?

extern int arr[some_constexpr]; // assume pre-populated
extern int* ptr; // assume points to non-atomic arr
int a = *ptr;
// want "memory barrier/fence" here
++ptr;

Would an atomic pointer ensure the correct ordering/sequencing?

#include <atomic>

extern int arr[some_constexpr];
extern std::atomic<int*> ptr;
int a = *(ptr.load());
// implicit "memory barrier" achieved here by use of atomics?
ptr.store(ptr + 1);

This relates to a lock-free queue shared between two threads. I want to ensure that the data associated with the pointer is not lost/corrupted before updating the pointer.

Your first example is not very good, because C++ guarantees the results "as if" the statements were executed in order. If there are threads in play, you have to show how, exactly, they relate. — lvella, Feb 26 '20 at 18:22
Wait a minute; is `int* ptr = arr;` a local variable in the first case? Its initializer is a compile-time (or link-time) constant; the address of an array in static storage. — Peter Cordes, Feb 26 '20 at 18:26
Check [this trhead](https://stackoverflow.com/questions/46415027/c-treiber-stack-and-atomic-next-pointers) — Victor Gubin, Feb 26 '20 at 18:28
If you need more context, see the queue implementation I'm using: https://stackoverflow.com/questions/60139241/volatile-member-variables-vs-volatile-object — abc, Feb 26 '20 at 18:29
@PeterCordes Treat ```int* ptr``` as though it is not a local variable. I included a link to my actual implementation in case that helps. — abc, Feb 26 '20 at 18:30
@lvella That's why I'm asking how to ensure they are executed in order. — abc, Feb 26 '20 at 18:32
`.load()` defaults to seq_cst which is stronger than acquire. See https://preshing.com/20120913/acquire-and-release-semantics/. But this doesn't make sense; in your 2nd example you have an atomic seq_cst load of ptr, then a non-atomic load of what it points to, then another seq_cst load of `ptr` again, then a seq_cst store to ptr. That's super different from what you're showing in the first example. In the first example, you have a fence after the non-atomic load from `*ptr`, instead of right after loading the shared value of `ptr` into some local temporary. — Peter Cordes, Feb 26 '20 at 18:32
*how to ensure they are executed in order* - the thing you need to keep in mind is that execution order can be different from the order that *other threads* see the operations in. Barriers / ordering control global-visibility order, not execution order. OoO exec and compile-time reordering (as-if rule) can still work their magic, within the constraints laid down by the memory-ordering specified in the source. — Peter Cordes, Feb 26 '20 at 18:34
@abc: your example would make more sense if you separated your shared vars from stuff that a single thread runs. Your mental model seems to mismatch with how things actually work, so you're going to need much clearer examples of what you want / want to avoid for anyone to be able to clear that up. (Or go read Jeff Preshing's articles for how things actually work, which is useful even if ISO C++'s formalism uses a different model.) — Peter Cordes, Feb 26 '20 at 18:37
@PeterCordes Yes, I don't know why it has to get so convoluted. I feel like what I want is simple: guarantee that neither the compiler nor processor will re-order those two instructions. But it requires a PhD in understanding multi-threading and atomics to implement such an enforcement? — abc, Feb 26 '20 at 18:38
You probably want to look at `memory_order_acquire` and `memory_order_release` for picking up and putting down the pointers for your queue. Happily, a lock-free one producer, one consumer queue is less tricky than multiple producers and/or multiple consumers. — Chris Hall, Feb 26 '20 at 18:39
@PeterCordes I attached a link to the Queue implementation I'm using. The code here is basically the "pop" function of that (assuming queue is non-empty). I am using an embedded target (ARM Cortex M), so I define the "thread" priorities via interrupt. This "pop" function happens within the "low-priority" main thread, and the "push" function (not shown here, but similar) happens within the "high-priority" interrupt thread. — abc, Feb 26 '20 at 18:40
ISO C++ doesn't talk about "instructions" or "statements" reordering, it talks about happens-before relationships. If you need a load to happen-before a store from the POV of other threads, make it an acquire or seq_cst load. https://preshing.com/20120913/acquire-and-release-semantics/ is basically what you need to understand to use atomics. Without atomic, there is zero guarantee of anything, and the compiler can assume that no other threads are writing any of the objects this thread accesses. — Peter Cordes, Feb 26 '20 at 18:41
@ChrisHall I have multiple producers at the same priority (interrupt service routines all mapped to the same interrupt level), but yes only one consumer (lower priority "main" thread). — abc, Feb 26 '20 at 18:41
@abc: "*But it requires a PhD in understanding multi-threading and atomics to implement such an enforcement?*" The problem is that you're trying to think of it in terms of "instructions" and "reordering" and the like, not in the terms of the C++ threaded memory model. What you *want* is to be able to set a value in one thread which is visible to another thread. Basically you've phrased it as an XY problem. — Nicol Bolas, Feb 26 '20 at 18:42
So you're using interrupts to get mutual exclusion between producers? Semi-related: [MCU programming - C++ O2 optimization breaks while loop](//electronics.stackexchange.com/a/387478) is about the single-core interrupts = signal handler case. — Peter Cordes, Feb 26 '20 at 18:42
@PeterCordes Okay, so since ```int a = *(ptr.load());``` implies a ```std::memory_order_seq_cst``` load of pointer (but then de-references it), does that also ensure that the de-referenced value will be loaded before ```ptr``` is subsequently stored in the increment instruction? — abc, Feb 26 '20 at 18:43
@NicolBolas Yes, I suppose that may be the case. I work close to "the metal" as an embedded person, so I think in terms of what the hardware is actually doing. I guess I have to step back and consider the abstraction that C++ is placing on top of the physical execution. — abc, Feb 26 '20 at 18:45
@PeterCordes There is no mutual exclusion, it is "lockless". The main thread will run (and may call pop). At any time, any number of interrupts assigned to the same, but higher-than-main priority level (so they cannot pre-empt each other, only main) could occur and perform a "push", which will read ```ptr``` and possibly update a value in ```arr``` depending on where ```ptr``` is at (i.e. if the circular buffer is full or not). — abc, Feb 26 '20 at 18:49
@abc: No, the acquire load only guarantees that everything after it in program order becomes globally visible after `ptr.load()`. How would one acquire load put a barrier *after* some later non-atomic load? But in your specific example, yes but only because you also used a seq_cst store (which is even stronger than `release`). Since you're compiling for ARM with normal compilers, you can basically think of acquire / release as ordering access to the coherent shared view of memory that all threads share, instead of just abstract C++ where IRIW reordering is also possible. — Peter Cordes, Feb 26 '20 at 18:55
If you care about efficiency, read `ptr` into a tmp variable! Don't load it twice for no reason. Since only the consume side modifies it, you don't need `.fetch_add`. — Peter Cordes, Feb 26 '20 at 18:59
Related: re trying to think about C++ in terms of memory barriers: First of all you're on a single-core cortex-M so you don't need thread barriers at all, only atomic_signal_fence. But see [How to achieve a StoreLoad barrier in C++11?](//stackoverflow.com/q/60053973) and [Does atomic\_thread\_fence(memory\_order\_seq\_cst) have the semantics of a full memory barrier?](//stackoverflow.com/q/25478029) for more about the general case of the mismatch between thinking about C++ operations in terms of how it maps to asm / hardware memory models. — Peter Cordes, Feb 26 '20 at 19:12
@PeterCordes "_If you need a load to happen-before a store_" or rather, *happen-after* a store (the store happens *before* the load.) — curiousguy, Apr 09 '20 at 04:17

Maxim Egorushkin · Accepted Answer · 2020-02-27T00:34:50.447

1

When ptr is std::atomic<int*>, ++ptr, or ptr++ or ptr.fetch_add(1, std::memory_order_acq_rel) ensure that no preceding/following loads/stores get reordered past/before this operation.

++ptr, or ptr++ are essentially ptr.fetch_add(1, std::memory_order_seq_cst) and std::memory_order_seq_cst is almost always an overkill (cannot give an example where it is not).

Even more efficient single reader is:

int arr[some_constexpr];
std::atomic<int*> ptr;

int* p = ptr.load(std::memory_order_acquire);
int element = *p;
ptr.store(p + 1, memory_order_release);

The above is basically how boost::lockfree::spsc_queue is implemented.

As a side note, boost::lockfree::spsc_queue is a true wait-free (strongest non-blocking guarantee of progress) queue. What push/pop operations do is 1 relaxed load, 1 acquire load and 1 release store and it is fundamentally not possible to implement a single-producer-single-consumer queue faster than that (sans implementation quality defects) with FIFO order guarantee. It is often used as a benchmark for all other queues. You may like to look into it.

edited Feb 27 '20 at 00:34

answered Feb 26 '20 at 23:01

Maxim Egorushkin

131,725
17
180
271

1. I disagree with your statement "ensure that no preceding/following loads/stores get reordered past/before this operation". From what I heard in Sutter's talk, any non-atomic loads/stores after the atomic increment can be re-ordered before the atomic increment. – abc Feb 26 '20 at 23:06
2. Regarding your comment on ```std::memory_order_seq_cst```, it is my understanding that the only thing ```*_seq_cst``` gets you over the "plain" ```*_acq_rel```, is that atomic loads/acquires after atomic stores/releases cannot be re-ordered in the case of ```*_seq_cst```. So yes I would agree it's overkill in this case, and really all that is needed is a ```std::memory_order_release``` at the point of increment. – abc Feb 26 '20 at 23:08
Look through 36:25-45:25, e.g. 40:00 – abc Feb 26 '20 at 23:09
1

@abc Read https://en.cppreference.com/w/cpp/atomic/memory_order: `memory_order_acq_rel`: A read-modify-write operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before or after this store. – Maxim Egorushkin Feb 26 '20 at 23:11
Okay, good point. Because an increment is a read/load/acquire *and* a write/store/release, it is essentially a full fence. – abc Feb 26 '20 at 23:13
@abc Also, I cannot recommend enough preshing.com, he even finds a few errors in that Sutter talk that Sutter admits. I couldn't stop reading it. – Maxim Egorushkin Feb 26 '20 at 23:15
1

Okay, good to know. I think now that I've been "initiated" via the talk, I will be able to understand what preshing is talking about. Previously it was going over my head. – abc Feb 26 '20 at 23:18
1

seq_cst is necessary in a few cases where you need to make sure your store is visible before you read something back to see the same view that other threads might have seen. e.g. double-checked locking is a canonical example. Another use-case I remember coming up with is [How to set bits of a bit vector efficiently in parallel?](//stackoverflow.com/a/45994685) where you read back to check for conflicts from multiple threads. – Peter Cordes Feb 26 '20 at 23:49
1

@abc Here is a [small reference](https://godbolt.org/z/zMYYGU). – Maxim Egorushkin Feb 26 '20 at 23:50
1

Also relevant is that seq_cst forbids IRIW reordering, although that doesn't generally matter either. Agreed that you usually don't need seq_cst. – Peter Cordes Feb 27 '20 at 00:02
@PeterCordes I would like to see examples for `seq_cst`. `gcc` and `clang` use [`std::once_flag` for function statics](https://gcc.godbolt.org/z/G8Kxfb) that must be thread safe in C++11, it doesn't use `seq_cst` directly or at all. It rather calls [`__cxa_guard_acquire`](https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libstdc%2B%2B-v3/libsupc%2B%2B/guard.cc#L243), which only uses `acq_rel`. – Maxim Egorushkin Feb 27 '20 at 00:04
@PeterCordes I mention _function statics_ because that is one robust way to implement thread-safe singletons in C++ and I assume singletons are the reason you mention double-locking. Not that C++11 thread safe _function statics_ advise or approve singletons, I can only recommend against using singletons. – Maxim Egorushkin Feb 27 '20 at 00:43
Oh, you're right. I was mis-remembering what double-checked locking requires. Wikipedia shows it working with only `atomic_thread_fence()` acq and rel for a singleton. And no, IDK why anyone would ever want singletons. Or what practical use-case https://preshing.com/20120515/memory-reordering-caught-in-the-act/ is boiled down from. My bit-vector hack needs a full barrier, and does use the GNU C `__atomic` equivalent of C++20 std::atomic_ref with `mo_seq_cst`, but only to get a full barrier as an implementation detail. It's full of C++ undefined behaviour, even moreso than a seqlock. – Peter Cordes Feb 27 '20 at 03:08

abc · Answer 2 · 2020-02-26T22:44:50.097

According to Herb Sutter's talk (which I highly recommend for trying to understand this stuff), e.g. see at time 36:25-45:25, the atomic example given in the OP will suffice to ensure that a is assigned before ptr is incremented.

This is because incrementing ptr is a store (since it is being written), and therefore a "release" given the default memory order of std::memory_order_seq_cst (but would also apply for *_acq_rel or *_release memory order in this case). This means that nothing occurring before the increment/store/release of ptr can be re-ordered to occur after the increment/store/release of ptr, whether explicitly or implicitly by the compiler, processor, and/or cache.

No explicit memory fence is required beyond those implied by atomic load/stores, and they are actually discouraged in the talk.

How to guarantee that load completes before store occurs?

2 Answers2

Linked