Reordering of operations and lock-free data structures

Question

Assume we have a Container maintaining a set of int values, plus a flag for each value indicating whether the value is valid. Invalid values are considered to be INT_MAX. Initially, all values are invalid. When a value is accessed for the first time, it is set to INT_MAX and its flag is set to valid.

struct Container {
  int& operator[](int i) {
    if (!isValid[i]) {
      values[i] = INT_MAX; // (*)
      isValid[i] = true;   // (**)
    }
    return values[i];
  }
  std::vector<int> values;
  std::vector<bool> isValid;
};

Now, another thread reads container values concurrently:

// This member is allowed to overestimate value i, but it must not underestimate it.
int Container::get(int i) {
  return isValid[i] ? values[i] : INT_MAX;
}

This is perfectly valid code, but it is crucial that lines (*) and (**) are executed in the given order.

Does the standard guarantee in this case that the lines are executed in the given order? At least from a single-threaded perspective, the lines could be interchanged, couldn't they?
If not, what is the most efficient way to ensure their order? This is high-performance code, so I cannot go without -O3 and do not want to use volatile.

@user1494080: It is actually undefined behavior to write to a non-`atomic` value that may be simultaneously accessed by another thread (in your case, `isValid` and `values`). If you're not familiar with how the C++ memory model works, this may be surprising and not what you expect. You need "acquire"/"release" semantics here. This is unrelated to the ordering issue between those two fields. — user541686, Dec 23 '16 at 10:37
@Mehrdad I wrote a comment to Dietmar Kühl's answer, but it also applies to your comment: "Strictly speaking, I think you are absolutely right. But in practice, at least on x86 architectures, reading and writing a 4-byte int value is always atomic, since a 4-byte int will never cross a 4-byte boundary. So in practice there should be no problem - in a sense that each thread will always read valid values. Am I wrong?" — user1494080, Dec 23 '16 at 10:55
@user1494080: Yes. This isn't solely about the architecture; it's about the compiler. Your store operations `(*)` and `(**)` have no inherent order, and the compiler is free to interleave them however it wants, as long as the resulting values are the same to the rest of the program. (There is also the issue that `vector` is compressed, but I'm ignoring that here.) If you want the stores to not be interleaved, then you have to store with release semantics. This will prevent the compiler from reordering the code itself. I recommend you Google `C++ atomic<> weapons` and watch the videos. — user541686, Dec 23 '16 at 12:05
@user1494080: the short answer is: yes, you are wrong! Have a look, e.g., at [this article](https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong). — Dietmar Kühl, Dec 23 '16 at 12:15

Dietmar Kühl · Accepted Answer · 2016-12-23T11:57:16.490

There is no synchronization here. If you access these values from one thread and change them from another thread you got undefined behavior. You'd either need a lock around all accesses in which case things are fine. Otherwise you'll need to make all your std::vector elements atomic<T> and you can control visibility of the values using the appropriate visibility parameters.

There seems to be a misunderstanding of what synchronization and in particular atomic operations do: their purpose is to make code fast! That may appear counter intuitive so here is the explanation: non-atomic operations should be as fast as possibe and there are deliberately no guarantees how they access memory exactly. As long as the compiler and execution system produce the correct results the compiler iand system are free to do whatever they need or want to do. To achieve good performance interaction between different threads are assumed to not exist.

In a concurrent system there are, however, interactions betwwen threads. This is where atomic operations enter the stage: they allow the specification of exactly the necessary synchronization needed. Thus, they allow to tell the compiler the minimal constraints it has to obey to make the thread unteraction correct. The compiler will use these indicators to generate the best possible code to achieve what is specified. That code may be identical to code not using any synchronization although in practice it is normally necessary to also prevent the CPU from reordering operations. As a result, correct use of the synchronization results in the most efficient code with only the absolutely necessary overhead.

The tricky part is to some extent finding which synchronizations are needed and to minimize these. Simply not having any will allow the compiler and the CPU to reorder operations freely and won't work.

Since the question mentioned volatile please note that volatile is entirely unrelated to concurrency! The primary purpose for volatile is to inform the system that an address points to memory whose access may have side effects. Primarily it is used to have memory mapped I/O or hardware control be accessible. Die to the potential of side effects it one of the two aspects of C++ defining the semantics of programs (the other one is I/O using standard library I/O facilities).

Strictly speaking, I think you are absolutely right. But in practice, at least on x86 architectures, reading and writing a 4-byte int value is always atomic, since a 4-byte int will never cross a 4-byte boundary. So in practice there should be no problem - in a sense that each thread will always read valid values. Am I wrong? — user1494080, Dec 23 '16 at 10:48
@user1494080: even if access to memory is atomic on a give platform, how does that guarantee visibility between different memory entities? I strongly recommend you read [this article](https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong). Also, if a given platform really doesn't need to do anything surely the corresponding engineers have worked that out and made use of that in their compilers... — Dietmar Kühl, Dec 23 '16 at 12:13
There is a kind of [follow-up question](http://stackoverflow.com/questions/41309299/mixing-c11-atomics-and-openmp) concerning this issue. — user1494080, Dec 24 '16 at 01:04

Reordering of operations and lock-free data structures

1 Answers1