How to express in C++11 ordinary store (export) and load (import) barriers (fences)?

Question

The following code implements some lock-free (and atomic-free!) inter-thread communication that requires the usage of the store and load memory barriers but the C++11 release-acquire semantics is not appropriate nor guarantee the correctness. Actually the algorithm exposes a need for a kind of inversion of the release-acquire semantic i.e. to signal that some operation didn't take place rather than it did.

volatile bool valid=true;
volatile uint8_t blob[1024] = {/*some values*/};

void zero_blob() {
    valid=false;
    STORE_BARRIER;
    memset(blob,0,1024);
}

int32_t try_get_sum(size_t index_1, size_t index_2) {
    uint8_t res = blob[index_1] + blob[index_2];
    LOAD_BARRIER;
    return valid ? res : -1; 
}

I'm able to make this code correct on all hardware architectures simply using native memory barriers e.g. on Intel there is no need for memory barriers here, on Sparc (RMO) membar #StoreStore and membar #LoadLoad, on PowerPC lwsync for both. So no big deal and the code is a typical example of using store and load barriers. Now, what C++11 construction should I use to make the code correct assuming that I don't want to convert 'blob' to std::atomic objects as it would make 'blob' a guard object and variable 'valid' a guarded one whereas it's the other way around. Converting variable 'valid' to a std::atomic object is OK for me, but there are no barriers to guarantee the correctness. To make it clear, let's consider the following code:

volatile std::atomic<bool> valid{true};
volatile uint8_t blob[1024] = {/*some values*/};

void zero_blob() {
    valid.store(false, std::memory_order_release);
    memset(blob,0,1024);
}

int32_t try_get_sum(size_t index_1, size_t index_2) {
    uint8_t res = blob[index_1] + blob[index_2];
    return valid.load(std::memory_order_acquire) ? res : -1; 
}

The code is incorrect as the barriers are placed in the wrong places and hence writing to 'blob' can precede writing to 'valid' or/and loading from 'valid' can precede loading from 'blob'. I thought that in order to deal with such constructions C++11 provided std::atomic_thread_fence and the code should be:

volatile std::atomic<bool> valid{true};
volatile uint8_t blob[1024] = {/*some values*/};

void zero_blob() {
    valid.store(false, std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    memset(blob,0,1024);
}

int32_t try_get_sum(size_t index_1, size_t index_2) {
    uint8_t res = blob[index_1] + blob[index_2];
    std::atomic_thread_fence(std::memory_order_acquire);
    return valid.load(std::memory_order_relaxed); ? res : -1; 
}

Unfortunately C++11 says:

A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.

which clearly states that std::atomic_thread_fence should be placed in the opposite sides of the operations on the atomic object.

LATER EDIT

Below please find much more usable example:

volatile uint64_t clock=1;
volatile uint8_t blob[1024] = {/*some values*/};

void update_blob(uint8_t vals[1024]) {
    clock++;
    STORE_BARRIER;
    memcpy(blob,vals,1024);
    STORE_BARRIER;
    clock++;
}

int32_t try_get_sum(size_t index_1, size_t index_2) {
    uint64_t snapshot = clock;
    if(snapshot & 0x1) {
        LOAD_BARRIER;
        uint8_t res = blob[index_1] + blob[index_2];
        LOAD_BARRIER;
        if(snapshot == clock)
            return res;
    }
    return -1;
}

I think your code has a _data race_ in it, which is UB according to C++ Standard. What is `memset`ting and reading `blob[index]` happens at the same time? The Standard doesn't say that `res` would be _unspecified_ then, it [clearly says that this is UB](http://eel.is/c++draft/intro.races#21.sentence-3). Of course, it may work with your implementation/environment, but I would advise against such code. — Daniel Langr, Sep 05 '19 at 16:25
[`volatile` is not useful to any of your code.](https://stackoverflow.com/a/12878500/734069) If you're making an `atomic` value `volatile`, then you're almost certainly doing it wrong. — Nicol Bolas, Sep 05 '19 at 16:57
`volatile std::atomic` well... that's a new one to add to my list of misuse of volatile — Guillaume Racicot, Sep 05 '19 at 19:32
@curiousguy: It's wrong in that `volatile` isn't helpful in the code accomplishing its goal, and its presence strongly suggests that the writer is suffering from a common mistake: the belief that `volatile` has anything to do with the visibility or thread-safety of actions. — Nicol Bolas, Sep 06 '19 at 13:27
"`atomic_thread_fence(std::memory_order_release);`" How would "release" say anything about stuff that have not happened yet? Releasing means the past is past; it creates a past. It doesn't creates a future (acquire does). — curiousguy, Sep 06 '19 at 15:35
@NicolBolas There is no rule that volatile cannot be used with threads for shared variables. (Even a volatile scalar is usable for inter threads communication in a few cases.) The poster here clearly doesn't expects volatile to make code thread safe in itself; he specifically asks about the additional stuff needed. Here volatile is expected to force the compiler to have predictable behavior. Its presence here strongly suggests that the writer has the right intuition. Dismissing the Q on the basis of volatile is very wrong. — curiousguy, Sep 06 '19 at 15:47
@curiousguy: "*The poster here clearly doesn't expects volatile to make code thread safe in itself*" Considering that the first code (the non-`atomic` version) very much does expect `volatile` to make the code thread-safe, I would not make that assumption. The fact that the user kept `volatile` in the `atomic` version also adds to the idea that the OP probably doesn't realize that it isn't helping. — Nicol Bolas, Sep 06 '19 at 15:49
@dervih: "*Below please find much more usable example:*" That's not something you can implement without a genuine mutex or a spinlock. Those "barriers" have to be something to prevent simultaneous execution, so that the code within `try_get_sum` doesn't execute while the `memcpy` is still going on. And so that the `memcpy` doesn't start executing while the other thread is reading the data. — Nicol Bolas, Sep 06 '19 at 16:15
The "more usable example" is essentially "SeqLock". It is possible to implement for SINGLE writer, multiple readers without locking, but I think it is much better to go ahead and mutex lock the write. If there is only one writer the lock will never be contended for, so it is relatively cost free, and prevents problems if multiple writers ever occur. Web search "SeqLock" for a lot of information, and a number of implementations. — ttemple, Sep 07 '19 at 02:16
@NicolBolas I'm fully aware that 'volatile' is not provided for the purpose of multi-threading and it never has. However it's poorly known that 'volatile' not only obligates a compiler to respect each read and write but also not to reorder surrounding code that may have visible side effect (https://en.cppreference.com/w/cpp/language/cv) Here I used volatile as I mixed accesses to a std::atomic object with non-atomic objects and wanted to be 100% sure that a compiler (not CPU) won't reorder them. — dervih, Sep 09 '19 at 08:21
@dervih: The memory order operations prevent reordering, even/especially of the non-atomic accesses around the atomic ones. If you write to a non-atomic, then do an atomic write with a memory-order-release, acquire operations the atomic which see your written value *will also* see any writes you made before the release to objects other than the atomic. That is what the memory orders are for; they control the visibility of things other than the atomic you fetched from. — Nicol Bolas, Sep 09 '19 at 13:24
@NicolBolas You're right but only when it comes to usage scenarios captured by C++11 standard, typically for critical section's release-acquire semantics. But in my case I wanted to prevent after-release writes being reorder before-release and before-acquire loads being reordered after-acquire. And fortunately it can be achieved using 'std::atomic_thread_fence' but all operations before as well as after must be done on 'std::atomic's' to meet C++11 standard. I've found a comprehensive answer to my problem at: https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf — dervih, Sep 09 '19 at 16:47
@DanielLangr Do you at least agree that a race condition on a volatile object is harmless as the CPU doesn't care about data races? (as long as you don't expect more than what the CPU guarantees) — curiousguy, Sep 12 '19 at 12:58
@curiousguy Quote from [this answer](https://stackoverflow.com/a/4558031/580083): _`volatile` is (nearly) useless for platform-agnostic, multithreaded application programming. It **does not provide any synchronization**, it does not create memory fences, nor does it ensure the order of execution of operations. **It does not make operations atomic.** **It does not make your code magically thread safe.** `volatile` may be the single-most misunderstood facility in all of C++._ Another [answer](https://stackoverflow.com/a/2485177/580083) on this topic. — Daniel Langr, Sep 13 '19 at 05:43
@DanielLangr Does that contradict what I asserted? An operation that is atomic at the CPU level (like word size store) is atomic when done as a volatile scalar write operation, because the C/C++ abstraction resolves to the CPU assembly instruction, and **the issue of data race disappears, because the CPU doesn't care about those**. So it is *not* useless; but it's rarely useful and almost never sufficient because of what you wrote. — curiousguy, Sep 13 '19 at 11:37
@curiousguy Are you sure that memory stores and load are atomic at all architectures when one might compile and run a C++ program? — Daniel Langr, Sep 13 '19 at 11:41
@DanielLangr No program is ever guaranteed to be portable to all architectures. All CPU I know of guarantee word size loads and stores (for a scalar variable). — curiousguy, Sep 13 '19 at 11:44
@curiousguy Not all architectures, but all architectures where C++ is implemented according to the C++ Standard. — Daniel Langr, Sep 13 '19 at 11:47
@DanielLangr No real world, useful program is portable to "all architectures where C++ is implemented according to the C++ Standard" (which is probably zero as I don't believe there can be such thing as an implementation of the C++ std). — curiousguy, Sep 13 '19 at 11:53

bobah · Answer 1 · 2019-09-06T06:37:08.910

3

According to the memory_order article, to be conservatively safe you need to use memory_order_release after the store and memory_order_acquire before the load (both of the same atomic variable).

So:

 std::atomic<int> var;

 // Writer
 // something important <happens-before> writing 42 in the writer thread
 var.store(42, std::std::memory_order_release);

 // Reader
 auto result = var.load(std::std::memory_order_acquire);
 if (result == 42) {
    // transitively, as the result's new value is observed, the "something important" is here too
 }

More generally, depending on what effect you need to achieve and what your target architecture supports, you can do it less conservatively.

You would generally prefer std::atomic_flag over std::atomic<bool>, as the former is guaranteed to be lock-free, unlike the latter.

Lastly - why not starting from a mutex-protected critical section, or even better, push the updates to the consumer via a lock free ring buffer, so they don’t share anything?

edited Sep 06 '19 at 06:37

answered Sep 05 '19 at 16:01

bobah

18,364
2
37
70

`std::atomic_flag` has nastiness from requiring `ATOMIC_FLAG_INIT` though. – Max Langhof Sep 05 '19 at 16:13
@MaxLanghof - yeah, no free cheese – bobah Sep 05 '19 at 16:14
1

No, I would generally prefer std::atomic over std::atomic_flag as I don't want slow excessive atomic instructions in my code merely to load and store values. – dervih Sep 05 '19 at 18:15
@bobah I don't think you've understood the problem I've described. It's impossible to get a std::atomic object together with release-acquire barriers to work in my case. – dervih Sep 09 '19 at 16:57

Nicol Bolas · Answer 2 · 2019-09-09T18:26:38.410

2

Below please find much more usable example:

Allow me to restate what you're essentially doing.

Your code is reading from memory. That memory may be updated at any time by some other thread. But you do not want to impose an execution synchronization (that is, some kind of mutex) between the reader and writer. So instead, you build a system that allows you to detect whether the reads you have already performed may have been overwritten by some other thread. And if they were, you just ignore the values you read.

C++ doesn't let you do that.

If you are reading from a non-atomic object which is potentially written by some other thread, without appropriate execution and memory synchronization between the two operations, then you have a data race. The presence of a data race does not mean that you may read incorrect values; it means that your code has undefined behavior.

You cannot undo UB. Once your program enters undefined-behavior land, all bets are off. At least, as far as the standard is concerned.

You can rewrite your code to work within the standard, of course. But it has to prevent the execution of the reads while a write may be happening, not merely checking after the fact to see if the read is OK. If the write is always 1KB, an atomic-based spinlock is probably appropriate for the writing function, and the reader can just return -1 if the atomic lock isn't available.

You can also write this kind of system (apparently called a "SeqLock") using C++ atomics, with the full and complete knowledge that it invokes UB as far as the standard is concerned. As long as the types you're copying are trivial, it'll work just fine.

Note that the C++ Concurrency TS 2 will include a feature that will allow SeqLock implementations. This will hopefully see the full standard with C++23.

edited Sep 09 '19 at 18:26

answered Sep 09 '19 at 16:07

Nicol Bolas

449,505
63
781
982

100% right and I would add unfortunately, because such a construction works on all hardware I know including Intel, Sparc, PowerPC and is the essential of SeqLock used in the Linux Kernel and also the fastest algorithms for Software Transactional Memory. Please mind the only incorrect thing is concurrent data access beyond std::atomic, which C++11 deliberately turned into undefined behavior following theoretical memory model and exotic hardwares having little common with the leading-edge C++ applications. Moreover lots of prior C++11 multi-thread code became incorrect as my example. – dervih Sep 09 '19 at 17:18
@dervih: None of your examples were ever "correct"; they were only what you got away with. That is, they were things which happened to compile to machine code that appeared to do what you wanted. That's still just as true post-C++11 as pre-C++11. – Nicol Bolas Sep 09 '19 at 17:34
But don't you think it's actually a failure of the C++ Standard Committee that prior to C++11 practically all multi-thread code relied on compilers and hardware to run correctly and with C++11 little has changed, especially when it comes to existing high-performance / low-latency / scalable applications. C++ is not chosen for applications because it's a nice language but rather because it allows you to get from the machine 100% of the power. Therefore it's sad that some high-performance techniques meet undefined behavior. Anyway very happy that C++ finally has some memory model. – dervih Sep 09 '19 at 18:00
@dervih: "*with C++11 little has changed*" Um, I don't buy that. Prior to C++11, you could not write threading code *at all* with well-defined behavior in accord with the standard. Having the ability to write *something* that is well-defined is an improvement. It may not allow *everything*, and it certainly doesn't make your old ad-hoc code correct, but that doesn't mean that not much has changed. Even this particular issue is currently being examined [for a C++23 fix with the Concurrency TS 2.0](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1478r2.html). – Nicol Bolas Sep 09 '19 at 18:23
GREAT :) It seems that speculative reads have chance to became legal in C++ eventually and all those fast Software Transactional Memory algorithms relying on them won't be condemned any more. You've made me a day as now I've got an argument that such approach is seriously considered to be an outright programming technique. THANKS! – dervih Sep 09 '19 at 18:53
Almost all programs have UB one way or another; you can argue that no program ever has defined behavior. What matters is the kind of UB. – curiousguy Sep 12 '19 at 12:54
@curiousguy _Almost all programs have UB one way or another;_ What? Do you mean that all programs have some bugs? Or even that well-written programs do not have defined behavior? Why? That's exactly why we have the Standard, to define behavior of programs written according to its rules. – Daniel Langr Sep 13 '19 at 05:55
@DanielLangr How do you have defined behavior in a MT program? You can't either in C or C++. It's impossible as these programs aren't sequential and C and C++ have UB (unlike Java). Also, programs that access base classes don't have defined behavior as pointers converted to a base still point to the derived object. Also programs that use string literals as these `char` objects are never created (their lifetime doesn't start). – curiousguy Sep 13 '19 at 11:34
@curiousguy Don't understand. Why a MT program has no defined behaivor? Could you provide any example? The C++ Standard clearly defines behavior of a _well-written_ MT program. Could you as well provide an example for the second case? There are well-defined cases for conversion between pointers to base and derived objects. – Daniel Langr Sep 13 '19 at 11:39
@DanielLangr 1) Please show that you can prove the correctness of any C or C++ MT program. You can't, because of possible UB. You can't ever exclude UB: ppl just assume that UB didn't occur. **You can't reason about MT programs in a language that has UB.** 2) Please show where a conversion of a ptr to a base has a well defined, usable value: what does the converted ptr points to? – curiousguy Sep 13 '19 at 11:42
@curiousguy Any (MT) program written according to the C++ Standard rules. UB means [not following these rules](http://eel.is/c++draft/defns.undefined). As for the second case, a pointer to base may, for example, point to the base subobject of a derived object. What's undefined here? – Daniel Langr Sep 13 '19 at 11:45
@DanielLangr 1) Please show a program proof then. There is no such thing. There is no rule that can be used to avoid UB. You can't have a valid program because you can't exclude UB in MT semantics (even with only one thread). **MT means not sequential and you can't guarantee anything.** 2) Where is the base ptr conversion well defined? – curiousguy Sep 13 '19 at 11:47
@curiousguy Very clearly here: http://eel.is/c++draft/conv.ptr#3.sentence-1. For MT programs, the proof is the Standard. In fact, I completely don't follow your conclusions and it seems to me that this discussion makes no sense. However, I would very like to see you write a separate question about what you say. It might be quite interesting to see opinions of others. – Daniel Langr Sep 13 '19 at 11:51
@DanielLangr Your statement "the proof is the Standard" is wrong: there is no single proof of program correctness in the std, not even for trivial program. If I ask a very difficult question, it will be deleted anyway. – curiousguy Sep 13 '19 at 11:52
@curiousguy It won't. Nobody will delete your question. You only risk downvotes or it's closing. However, if you are so sure about this, you should not be afraid. If you are right, you will get many upvotes. It would be a giant breakthrough in C++ to prove that there is no program with defined behavior. – Daniel Langr Sep 13 '19 at 11:56

score -1 · Answer 3 · answered Sep 08 '19 at 07:12

Your example is correct. I don't think the standard is very clear when it comes to how non-atomics are handled.

std::atomic<bool> valid{true};           // removed volatile
uint8_t blob[1024] = {/*some values*/};  // removed volatile

void zero_blob() {
    valid.store(false, std::memory_order_relaxed);            // A)
    std::atomic_thread_fence(std::memory_order_release);      // B)
    memset(blob,0,1024);                                      // C)
}                                                              

int32_t try_get_sum(size_t index_1, size_t index_2) {          
    uint8_t res = blob[index_1] + blob[index_2];              // D)
    std::atomic_thread_fence(std::memory_order_acquire);      // E)
    return valid.load(std::memory_order_relaxed) ? res : -1;  // F)
}

The fence in step B) guarantees program order will be respected, and the standard doesn't really say this but the store to valid should be propagated before any subsequent writes in the same thread.

The fence in step E) guarantees program order will be respected.

So A) inter-thread happens-before C) and D) happens before F)

If F) perceives that valid is true, then, assuming no ABA problem, F) happens before A).

If F) happens before A), then D) happens before A).

If F perceives that valid is false, then A) happens before F). This doesn't necessarily mean that C) happens before D), but we discard the result just in case. It really is important to discard because the values in blob can be totally invalid. (Though using uint8_t protects against partial read from conversion to void*. It's cache-aligned, but It's technically not word-sized, which calls the atomicity into question. IIRC it could also theoretically be vulnerable to a true partial-read on some unlikely architectures.)

These deductions are based on the Linux Kernel Memory Consistency Model (LKMM), not the C++ memory model. From what I've read on cppreference.com, when they talk about propagation order in the C++ memory model, they uses phrases like "all changes prior to X in thread A will be visible after Y in thread B", but usually only in the context of atomic operations. I think it's relatively low-risk to expect that the C++ compiler will emit instructions that guarantee propagation order, at least on consumer-grade CPUs. You should still check the assembly and consult the CPU documentation. You'll definitely be fine on x86 thanks to TSO, but I haven't ever looked out at what propagation order primitives are available on weakly-ordered architectures.

Lastly, if you are going to be re-using blob, there should be another atomic variable to indicate that the memset is done.

How to express in C++11 ordinary store (export) and load (import) barriers (fences)?

3 Answers3