Using a simplified version of a basic seqlock , gcc reorders a nonatomic load up across an atomic load(memory_order_seq_cst)
when compiling the code with -O3
. This reordering isn't observed when compiling with other optimization levels or when compiling with clang ( even on O3
). This reordering seems to violate a synchronizes-with relationship that should be established and I'm curious to know why gcc reorders this particular load and if this is even allowed by the standard.
Consider this following load
function:
auto load()
{
std::size_t copy;
std::size_t seq0 = 0, seq1 = 0;
do
{
seq0 = seq_.load();
copy = value;
seq1 = seq_.load();
} while( seq0 & 1 || seq0 != seq1);
std::cout << "Observed: " << seq0 << '\n';
return copy;
}
Following seqlock procedure, this reader spins until it is able to load two instances of seq_
, which is defined to be a std::atomic<std::size_t>
, that are even ( to indicate that a writer is not currently writing ) and equal ( to indicate that a writer has not written to value
in between the two loads of seq_
). Furthermore, because these loads are tagged with memory_order_seq_cst
( as a default argument ), I would imagine that the instruction copy = value;
would be executed on each iteration as it cannot be reordered up across the initial load, nor can it reordered down below the latter.
However, the generated assembly issues the load from value
before the first load from seq_
and is even performed outside of the loop. This could lead to improper synchronization or torn reads of value
that do not get resolved by the seqlock algorithm. Additionally, I've noticed that this only occurs when sizeof(value)
is below 123 bytes. Modifying value
to be of some type >= 123 bytes yields the correct assembly and is loaded upon each loop iteration in between the two loads of seq_
. Is there any reason why this seemingly arbitrary threshold dictates which assembly is generated?
This test harness exposes the behavior on my Xeon E3-1505M, in which "Observed: 2" will be printed from the reader and the value 65535 will be returned. This combination of observed values of seq_
and the returned load from value
seem to violate the synchronizes-with relationship that should be established by the writer thread publishing seq.store(2)
with memory_order_release
and the reader thread reading seq_
with memory_order_seq_cst
.
Is it valid for gcc to reorder the load, and if so, why does it only do so when sizeof(value)
is < 123? clang, no matter the optimization level or the sizeof(value)
will not reorder the load. Clang's codegen, I believe, is the appropriate and correct approach.