2

As described in other posts, without any sort of volatile or std::atomic qualification, the compiler and/or processor is at liberty to re-order the sequence of statements (e.g. assignments):

// this code
int a = 2;
int b = 3;
int c = a;
b = c;

// could be re-ordered/re-written as the following by the compiler/processor
int c = 2;
int a = c;
int b = a;

However, are conditionals and control statements (e.g. if, while, for, switch, goto) also allowed to be re-ordered, or are they considered essentially a "memory fence"?

int* a = &previously_defined_int;
int b = *a;

if (a == some_predefined_ptr)
{
   a = some_other_predefined_ptr; // is this guaranteed to occur after "int b = *a"?
}

If the above statements could be re-ordered (e.g. store a in a temporary register, update a, then populate b by de-reference the "old" a in the temporary register), which I guess they could be while still satisfying the same "abstract machine" behavior in a single-threaded environment, then why aren't there problems when using locks/mutexes?

bool volatile locked = false; // assume on given implementation, "locked" reads/writes occur in 1 CPU instruction
                              // volatile so that compiler doesn't optimize out

void thread1(void)
{
    while (locked) {}
    locked = true;
    // do thread1 things // couldn't these be re-ordered in front of "locked = true"?
    locked = false;
}

void thread2(void)
{
    while (locked) {}
    locked = true;
    // do thread2 things // couldn't these be re-ordered in front of "locked = true"?
    locked = false;
}

Even if std::atomic was used, non-atomic statements can still be re-ordered around atomic statements, so that wouldn't help ensure that "critical section" statements (i.e. the "do threadX things") were contained within their intended critical section (i.e. between the locking/unlocking).


Edit: Actually, I realize the lock example doesn't actually have anything to do with the conditional/control statement question I asked. It would still be nice to have clarification on both of the questions asked though:

  • re-ordering within and around conditional/control statements
  • are locks/mutexes of the form given above robust?
    • so far the answer from the comments is, "no, because there is a race condition between the while() check and claiming the lock", but apart from that, I am also wondering about the placement of thread function code outside of the "critical section"
curiousguy
  • 8,038
  • 2
  • 40
  • 58
abc
  • 212
  • 3
  • 14
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/208614/discussion-on-question-by-abc-does-statement-re-ordering-apply-to-conditional-co). – Samuel Liew Feb 27 '20 at 05:39
  • "_Even if `std::atomic` was used_" Then pls show us how you use `std::atomic`. We can only comment on real use of a primitive, not what you may or may not write. – curiousguy Mar 02 '20 at 02:34
  • @curiousguy I see you left me quite a few comments :). I guess just to close this out, I will say that I may have had a few misconceptions when I posted this (e.g. while atomics cannot enforce a specifc ordering of non-atomics, they can provide fences across which non-atomics cannot be re-ordered). However, ultimately I discovered that my (ARM Cortex M4) compiler does not support C++11 atomics, so I could either step down into assembly to implement fences, or just use the typically provided "disable/enable" interrupts macro, which are implemented with fences to prevent their own re-ordering. – abc Mar 02 '20 at 14:15
  • @abc If you do anything the compiler can't "understand", like a call to an external function (one that's separately compiled), then the compiler will never visibly reorder instructions w/ that call (only operations on purely local vars can be reordered). So if you call a function that contains the proper sync primitive, you are fine. It means that you can write `atomic` like classes w/ inline asm or other primitives. – curiousguy Mar 02 '20 at 15:21

5 Answers5

4

The "as-if" rule ([intro.abstract]) is important to note here:

The semantic descriptions in this document define a parameterized nondeterministic abstract machine. This document places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below

Anything* can be reordered so long as the implementation can guarantee that the observable behavior of the resulting program is unchanged.

Thread synchronization constructs in general cannot be implemented properly without fences and preventing reordering. For example, the standard guarantees that lock/unlock operations on mutexes shall behave as atomic operations. Atomics also explicitly introduce fences, especially with respect to the memory_order specified. This means that statements depending on an (unrelaxed) atomic operation cannot be reordered, otherwise the observable behavior of the program may change.

[intro.races] talks a great deal about data races and ordering.

AndyG
  • 39,700
  • 8
  • 109
  • 143
  • Do atomic load()/store() operations enforce memory fences themselves (as @walnut alludes to in the OP comments), or must an explicit memory fence (e.g. ```std::atomic_thread_fence```) be used? I was under the impression it was the latter, that atomic load()/store() were orthogonal to memory fences (except in the case of subsequent atomic load()/store()s). – abc Feb 26 '20 at 17:41
  • While the standard doesn't come out and say it as far as I know, de-facto there are memory fences on atomic operations because of the enforced ordering they imply. – AndyG Feb 26 '20 at 17:46
  • @abc: `.load(memory_order_relaxed)` doesn't need any barriers on any CPU, and on some ISAs even seq_cst loads don't need any barrier instructions. (e.g. on x86, plain asm loads are acquire, plain stores are release). AArch64 has special acquire and sequential-release load and store instructions so compilers can implement the default `mo_seq_cst` semantics on top of the AArch64 memory model with just order-load and ordered-store instructions (LDAR / STLR), not dedicated barriers. You definitely don't need to use `atomic_thread_fence` in the C++ source. – Peter Cordes Feb 26 '20 at 18:05
  • @AndyG I'm not sure that's correct based on this post: https://stackoverflow.com/questions/25478029/does-atomic-thread-fencememory-order-seq-cst-have-the-semantics-of-a-full-memo – abc Feb 26 '20 at 20:04
  • @abc: I don't quite follow, can you elaborate? I think I've allowed that an implementation can do what it wants, but generally there are memory fences, and the most upvoted answer on that question appears to demonstrate that all the architectures tested on did indeed use a fence. I feel like I've missed something, sorry. – AndyG Feb 26 '20 at 20:15
  • @AndyG While your answer says that memory fences are required (which I would agree with based on what I have read elsewhere), I guess I interpreted your comment to mean that an atomic load/store necessarily implies a memory fence, which I would disagree with (even in the case of the default load/store ```std::memory_order_seq_cst```). But yeah, maybe it's unclear. – abc Feb 26 '20 at 20:49
  • @abc: Sorry for the lack of clarity. In my answer, when I refer to fences, I'm referring specifically to the fences discussed by the standard in [[atomics.order]](http://eel.is/c++draft/atomics.order#4.2). These are not necessarily full memory barriers, but do influence reordering operations. – AndyG Feb 26 '20 at 20:53
2

Branches are very much the opposite of a memory fence in assembly. Speculative execution + out-of-order exec means that control dependencies are not data dependencies, so for example if(x) tmp = y; can load y without waiting for a cache miss on x. See memory model, how load acquire semantic actually works?

Of course in C++, this just means that no, if() doesn't help. The definitions of (essentially deprecated) memory_order_consume might even specify that if isn't a data dependency. Real compilers promote it to acquire because it's too hard to implement as originally specified, though.

So TL:DR: you still need atomics with mo_acquire and mo_release (or stronger) if you want to establish a happens-before between two threads. Using a relaxed variable in an if() doesn't help at all, and in fact makes reordering in practice on real CPUs easier.

And of course non-atomic variables aren't safe without synchronization. An if(data_ready.load(acquire)) is sufficient to protect access to a non-atomic variable, though. So is a mutex; mutex lock/unlock count as acquire and release operations on the mutex object, according to C++ definitions. (Many practical implementations involve full barriers, but formally C++ only guarantees acq and rel for mutexes)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • If you had the case ```some_statement; if(x) tmp = y;```, I suppose there is no requirement that the assignment of ```tmp``` happens after ```some_statement;```? – abc Feb 26 '20 at 17:44
  • @abc: right, of course not, unless `some_statement` includes an acquire (or seq_cst) operation, or if `some_statement` writes to `y`. (It is sequenced-before so single-threaded execution of course has to respect that.) My intent was that `x` and `y` are shared variables while `tmp` is a local, so assigning to `tmp` isn't the observable part, it's the reading of `y` that's interacting with shared state. (And that can be indirectly observable based on what this thread does with tmp, e.g. eventually storing it somewhere.) – Peter Cordes Feb 26 '20 at 18:01
  • Related: [Multithreading program stuck in optimized mode but runs normally in -O0](//stackoverflow.com/q/58516052) shows how thoroughly broken code is that doesn't properly use `atomic<>` for access to shared variables. – Peter Cordes Feb 26 '20 at 18:01
  • Okay, thanks. I made another question targeted to the specific issue I'm dealing with, as I'm still not grasping all the concepts associated with this: https://stackoverflow.com/questions/60420163/how-to-guarantee-that-load-completes-before-store-occurs – abc Feb 26 '20 at 18:23
2

Conditions can also be re-ordered so long as no program that follows the rules will have its behavior affected by the re-ordering. There aren't problems with locks/mutexes because optimizations are only legal if they don't break programs that follow the rules. Programs that properly use locks and mutexes follow the rules, so the implementation has to be careful not to break them.

You example code using while (locked) {} where locked is volatile either follows the platform's rules or it doesn't. There are some platforms where volatile has guaranteed semantics that make this code work and that code would be safe on those platforms. But on platforms where volatile has no specified multi-thread semantics, all bets are off. Optimizations are allowed to break the code since the code relies on behavior the platform does not guarantee.

I am also wondering about the placement of thread function code outside of the "critical section"

Check your platform's documentation. It either guarantees that operations won't be re-ordered around accesses to volatile objects or it doesn't. If it doesn't, then it is free to re-order the operations and you would be foolish to rely on it not doing so.

Be careful. There was a time when you often had pretty much no choice but to do things like this. But that was a long time ago and modern platforms provide sensible atomic operations, operations with well-defined memory visibility, and so on. There is a history of new optimizations coming out that significantly improve performance of realistic code but break code that relied on assumed, but not guaranteed, semantics. Don't add to the problem without reason.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • Okay, yes I would rather not rely on a specific implement of the compiler/processor, but rather use portable code that will be executed as desired regardless of the compiler/processor. – abc Feb 26 '20 at 18:24
0

Compiler is free to 'Optimize' the source code without affecting the result of the program.

The optimization can be re-ordering or removal of unnecessary statements, as long as the net result of the program does not get affected.

For example, the following assignments and 'if' condition can be optimized into a single statement:

Before optimization:

int a = 0;
int b = 20;
// ...
if (a == 0) {
    b = 10;
}

Optimized code

int b = 10;
Gopinath
  • 4,066
  • 1
  • 14
  • 16
0

At 41:27 of Herb Sutter's talk, he states that "opaque function calls" (I would presume functions that the compilation can only see the declaration, but not definition of) require a full memory barrier. Therefore, while conditional/control statements are fully transparent and can be re-ordered by the compiler/processor, a workaround could be to use opaque functions (e.g. #include a library with functions that have "NOP" implementations in a source file).

abc
  • 212
  • 3
  • 14
  • 2
    You're mis-interpreting or mis-stating that point at least slightly. An opaque function call must be a full barrier for *compile-time* reordering of access to any objects whose address has "escaped" the current function (i.e. that could possibly be pointed-to by some global variable), in case this function read or writes them via such a ref. Exactly like GNU C `asm("":::"memory")`. (Or the way `atomic_signal_fence()` is normally implemented.) But calling a function does *not* require inter-thread ordering, e.g. no `dmb ish` full barrier on ARM. – Peter Cordes Feb 26 '20 at 22:07
  • 1
    For your use-case on a single-core system, you can use `atomic_signal_fence()` with mo_relaxed `atomic<>` as a full barrier wrt. interrupts on this core. It's free (no asm instructions) other than blocking optimizations (which you need to block). But that's not an answer to this question because that's not the question you asked. :/ – Peter Cordes Feb 26 '20 at 22:08
  • @PeterCordes I see the distinction you're making in the first comment, good point. But in the second comment, I don't actually think an explicit fence() call is required depending on what is needed, according to the talk given by Herb Sutter (he discourages the use of explicit fences), which I referenced in an answer on another post: https://stackoverflow.com/a/60423671/7775391. I think using ```std::atomic``` loads (acquires) and stores (releases) can achieve the desired effect. – abc Feb 26 '20 at 22:43
  • 1
    Yes, you could do that, and it will compile to `ldr` + `dmb ish` instructions for 32-bit ARMv7 because it has to make code that's safe even for a multi-core system. https://godbolt.org/z/sWQVJF. But you don't need that: `memory_order_relaxed` + *signal* fences will give compile-time ordering without any runtime barriers. Since all your code runs on the same core, it will respect program order at runtime. See [MCU programming - C++ O2 optimization breaks while loop](//electronics.stackexchange.com/a/387478) – Peter Cordes Feb 26 '20 at 23:29