Why "memory_order_relaxed" treat as "memory_order_seq_cst" in my system [C++]

Question

My Code :

std::atomic<int> x(22) , y(22);
int temp_x = -1, temp_y = -1;

void task_0(){
      x.store(33, std::memory_order_relaxed);
      temp_y = y.load(std::memory_order_relaxed);
}

void task_1(){
      y.store(33, std::memory_order_relaxed);
      temp_x = x.load(std::memory_order_relaxed);
}

int main(){
      std::thread t1(task_0);
      std::thread t2(task_1);

      t1.join();
      t2.join();

      std::cout<<temp_x<<" : "<<temp_y<<"\n";

return 0;
}

The problem is that as I use "memory_order_relaxed" So after testing 100 times one of my output should be " 22 : 22 " but my program gives :

Output :

  "33 : 33"
  "22 : 33"
  "33 : 22"

but it not gives "22 : 22" output

I tested this program in my 64 bit 2.9 GHz Quad-Core Intel Core i7 architecture. So guys what's wrong with my program, is there something that I need to understand ?

The language specification describes the minimum requirements, but implementations are permitted to do more. Intel x86, in particular, has [rather strong ordering behavior](https://stackoverflow.com/a/11836383/902497). — Raymond Chen, Aug 13 '20 at 05:24

score 1 · Answer 1 · answered Aug 13 '20 at 13:39

Just because the standard says that a particular eventuality is possible does not mean that what causes it to happen is governed by random numbers. On real machines, the result of unspecified behavior is governed by the execution of opcodes, caches, and so forth on those actual machines.

So while a result is theoretically possible, that doesn't mean it will definitely happen. In your particular case, to get 22 from both, the compiler (or CPU) would basically have to re-order at least one of the two functions. If there's nothing to gain from such reordering, then it probably won't happen.

The compiler has nothing to gain from reordering, but store buffering is the most basic CPU optimization. Even x86 does it all the time. — Humphrey Winnebago, Aug 13 '20 at 20:14

score 0 · Answer 2 · answered Aug 13 '20 at 20:10

The reordering is possible. Your experiment is just a little loosey goosey.

The reordering ("22 : 22") IS allowed on x86. x86 allows store-load reordering, i.e. Within a thread, a load can complete before a previous store to a different variable.

Be sure to compile with optimizations on.

Examine the generated code to make sure it is what you think it is. The compiler IS allowed to swap MO relaxed, but might not. Note that even x86 stores require a lock xchg to be SC, so if you don't see that, it is NOT memory_order_seq_cst. (But even if you did see that, it would be allowed since the compiler is theoretically allowed to implement memory order with a MORE strict implementation than is required.)

Your experiment setup has a few confounding issues.

To see the reordering, the x.store and y.store have to happen at almost exactly the same time down to 10's of nanoseconds. So you'll need a way to sync these up OR change your experiment to increase the number of opportunities for reordering.
The cost to start a thread is extremely high compared to a store/load. It's likely that one thread completes before the other starts. (I'm actually surprised that you don't always see "22 : 33").
To see reordering, the commands need to happen on different cores. Starting 2 threads does not guarantee that they run on different cores. They could both run on the same core in sequence. It depends on how the OS schedules it. You need to find a way to set the CPU affinity for the threads.

An additional possible factor is that you might not see the reordering if the threads are running on different logical cores on the same physical core. You have an Intel quad-core, so there are only 2 physical cores with 2 logical cores each. Intel does not SAY that reordering is not possible between logical cores on the same physical core, but if you think about it, it's less likely to happen (the window of opportunity is smaller) since the store doesn't have to go through the bus to be seen by the neighbor core. So to control for that possiblity, I would set the core affinity for the two threads to 0 and 2 respectively.
If the global variable is hot in-cache, the store happens almost instantly. You have to think about what's going on with the cache coherency protocol and set up your experiment accordingly.
You may have false sharing with your atomic variables. They may be on the same cache line. It's cache lines that are sent on the bus, gotten in exclusive mode, etc. So put some padding between them to make sure they are on a different cache line.

Why "memory_order_relaxed" treat as "memory_order_seq_cst" in my system [C++]

2 Answers2