In answering this question a further question about the OP's situation came up that I was unsure about: it's mostly a processor architecture question, but with a knock-on question about the C++ 11 memory model as well.
Basically, the OP's code was looping infinitely at higher optimization levels because of the following code (slightly modified for simplicity):
while (true) {
uint8_t ov = bits_; // bits_ is some "uint8_t" non-local variable
if (ov & MASK) {
continue;
}
if (ov == __sync_val_compare_and_swap(&bits_, ov, ov | MASK)) {
break;
}
}
where __sync_val_compare_and_swap()
is GCC's atomic CAS built-in. GCC (reasonably) optimized this into an infinite loop in the case that bits_ & mask
was detected to be true
before entering the loop, skipping the CAS operation entirely, so I suggested the following change (which works):
while (true) {
uint8_t ov = bits_; // bits_ is some "uint8_t" non-local variable
if (ov & MASK) {
__sync_synchronize();
continue;
}
if (ov == __sync_val_compare_and_swap(&bits_, ov, ov | MASK)) {
break;
}
}
After I answered, OP noted that changing bits_
to volatile uint8_t
seems to work as well. I suggested not to go that route, since volatile
should not normally be used for synchronization, and there doesn't seem to be much downside to using a fence here anyway.
However, I thought about it more, and in this case the semantics are such that it doesn't really matter if the ov & MASK
check is based on a stale value, as long as it's not based on an indefinitely stale one (i.e. as long as the loop is broken eventually), since the actual attempt to update bits_
is synchronized. So is volatile
enough here to guarantee that this loop terminates eventually if bits_
is updated by another thread such that bits_ & MASK == false
, for any existent processor? In other words, in the absence of an explicit memory fence, is it practically possible for reads not optimized out by the compiler to be effectively optimized out by the processor instead, indefinitely? (EDIT: To be clear, I'm asking here about what modern hardware might actually do given the assumption that reads are emitted in a loop by the compiler, so it's not technically a language question although expressing it in terms of C++ semantics is convenient.)
That's the hardware angle to it, but to update it slightly and make it also an answerable question about the C++11 memory model as well, consider the following variation to the code above:
// bits_ is "std::atomic<unsigned char>"
unsigned char ov = bits_.load(std::memory_order_relaxed);
while (true) {
if (ov & MASK) {
ov = bits_.load(std::memory_order_relaxed);
continue;
}
// compare_exchange_weak also updates ov if the exchange fails
if (bits_.compare_exchange_weak(ov, ov | MASK, std::memory_order_acq_rel)) {
break;
}
}
cppreference claims that std::memory_order_relaxed
implies "no constraints on reordering of memory accesses around the atomic variable", so independent of what actual hardware will or will not do, does imply that bits_.load(std::memory_order_relaxed)
could technically never read an updated value after bits_
is updated on another thread in a conforming implementation?
EDIT: I found this in the standard (29.4 p13):
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
So apparently waiting "infinitely long" for an updated value is (mostly?) out of the question, but there's no hard guarantee of any specific time interval of freshness other than that is should be "reasonable"; still, the question about actual hardware behavior stands.