The reordering is possible. Your experiment is just a little loosey goosey.
The reordering ("22 : 22") IS allowed on x86. x86 allows store-load reordering, i.e. Within a thread, a load can complete before a previous store to a different variable.
Be sure to compile with optimizations on.
Examine the generated code to make sure it is what you think it is. The compiler IS allowed to swap MO relaxed, but might not. Note that even x86 stores require a lock xchg
to be SC, so if you don't see that, it is NOT memory_order_seq_cst
. (But even if you did see that, it would be allowed since the compiler is theoretically allowed to implement memory order with a MORE strict implementation than is required.)
Your experiment setup has a few confounding issues.
To see the reordering, the x.store
and y.store
have to happen at
almost exactly the same time down to 10's of nanoseconds. So you'll
need a way to sync these up OR change your experiment to increase
the number of opportunities for reordering.
The cost to start a thread is extremely high compared to a
store/load. It's likely that one thread completes before the other
starts. (I'm actually surprised that you don't always see "22 :
33").
To see reordering, the commands need to happen on different cores.
Starting 2 threads does not guarantee that they run on
different cores. They could both run on the same core in sequence. It
depends on how the OS schedules it. You need to find a way to set
the CPU affinity for the threads.
An additional possible factor is that you might not see the
reordering if the threads are running on different logical cores on
the same physical core. You have an Intel quad-core, so there are
only 2 physical cores with 2 logical cores each. Intel does not SAY
that reordering is not possible between logical cores on the same
physical core, but if you think about it, it's less likely
to happen (the window of opportunity is smaller) since the
store doesn't have to go through the bus to be seen by the neighbor core. So to control
for that possiblity, I would set the core affinity for the two
threads to 0 and 2 respectively.
If the global variable is hot in-cache, the store happens almost
instantly. You have to think about what's going on with the cache
coherency protocol and set up your experiment accordingly.
You may have false sharing with your atomic variables. They may be
on the same cache line. It's cache lines that are sent on the bus,
gotten in exclusive mode, etc. So put some padding between them to
make sure they are on a different cache line.