Can volatile but unfenced reads yield indefinitely stale values? (on real hardware)

Question

In answering this question a further question about the OP's situation came up that I was unsure about: it's mostly a processor architecture question, but with a knock-on question about the C++ 11 memory model as well.

Basically, the OP's code was looping infinitely at higher optimization levels because of the following code (slightly modified for simplicity):

while (true) {
    uint8_t ov = bits_; // bits_ is some "uint8_t" non-local variable
    if (ov & MASK) {
        continue;
    }
    if (ov == __sync_val_compare_and_swap(&bits_, ov, ov | MASK)) {
        break;
    }
}

where __sync_val_compare_and_swap() is GCC's atomic CAS built-in. GCC (reasonably) optimized this into an infinite loop in the case that bits_ & mask was detected to be true before entering the loop, skipping the CAS operation entirely, so I suggested the following change (which works):

while (true) {
    uint8_t ov = bits_; // bits_ is some "uint8_t" non-local variable
    if (ov & MASK) {
        __sync_synchronize();
        continue;
    }
    if (ov == __sync_val_compare_and_swap(&bits_, ov, ov | MASK)) {
        break;
    }
}

After I answered, OP noted that changing bits_ to volatile uint8_t seems to work as well. I suggested not to go that route, since volatile should not normally be used for synchronization, and there doesn't seem to be much downside to using a fence here anyway.

However, I thought about it more, and in this case the semantics are such that it doesn't really matter if the ov & MASK check is based on a stale value, as long as it's not based on an indefinitely stale one (i.e. as long as the loop is broken eventually), since the actual attempt to update bits_ is synchronized. So is volatile enough here to guarantee that this loop terminates eventually if bits_ is updated by another thread such that bits_ & MASK == false, for any existent processor? In other words, in the absence of an explicit memory fence, is it practically possible for reads not optimized out by the compiler to be effectively optimized out by the processor instead, indefinitely? (EDIT: To be clear, I'm asking here about what modern hardware might actually do given the assumption that reads are emitted in a loop by the compiler, so it's not technically a language question although expressing it in terms of C++ semantics is convenient.)

That's the hardware angle to it, but to update it slightly and make it also an answerable question about the C++11 memory model as well, consider the following variation to the code above:

// bits_ is "std::atomic<unsigned char>"
unsigned char ov = bits_.load(std::memory_order_relaxed);
while (true) {
    if (ov & MASK) {
        ov = bits_.load(std::memory_order_relaxed);
        continue;
    }
    // compare_exchange_weak also updates ov if the exchange fails
    if (bits_.compare_exchange_weak(ov, ov | MASK, std::memory_order_acq_rel)) {
        break;
    }
}

cppreference claims that std::memory_order_relaxed implies "no constraints on reordering of memory accesses around the atomic variable", so independent of what actual hardware will or will not do, does imply that bits_.load(std::memory_order_relaxed) could technically never read an updated value after bits_ is updated on another thread in a conforming implementation?

EDIT: I found this in the standard (29.4 p13):

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

So apparently waiting "infinitely long" for an updated value is (mostly?) out of the question, but there's no hard guarantee of any specific time interval of freshness other than that is should be "reasonable"; still, the question about actual hardware behavior stands.

Reading your comment below, your question is whether hardware would optimize away instructions on `volatile` variables? — didierc, Mar 17 '13 at 23:30
@didierc yeah, basically, if it has a cached value and doesn't detect any non-fenced writes to to the value in the same thread, will any real hardware just skip further reads? it's not *a priori* impossible or stupid; it's just as reasonable for hardware to optimize these things as for the compiler to, for the same reasons — Stephen Lin, Mar 17 '13 at 23:31
`volatile` is just a shorthand here for "actually existing in the assembly output" — Stephen Lin, Mar 17 '13 at 23:31
if the cached value is in a register, I suppose this might happen. If it's in cache, there are protocols to synchronize caches of multicore CPU to avoid that sort of problem I think. I found [this article](http://www.windowsnetworking.com/articles-tutorials/common/Cache-Coherency.html) which introduces the topic. — didierc, Mar 17 '13 at 23:53
@didierc well, it's only a "problem" if it's defined to be one...technically, there's no fence here, so it's arguably correct for the processor to elide further reads if it doesn't affect single-threaded output...processors could be just as aggressive about these things as compilers if they wanted to be and documented their behavior (but maybe they aren't because it would lead to complaints from people that are careless about fencing). — Stephen Lin, Mar 17 '13 at 23:58

score 9 · Accepted Answer · answered Mar 17 '13 at 23:05

9

C++11 atomics deal with three issues:

ensuring that a complete value is read or written without a thread switch; this prevents tearing.
ensuring that the compiler does not re-order instructions within a thread across an atomic read or write; this ensures ordering within the thread.
ensuring (for appropriate choices of memory order parameters) that data written within a thread prior to an atomic write will be seen by a thread that reads the atomic variable and sees the value that was written. This is visibility.

When you use memory_order_relaxed you don't get a guarantee of visibility from the relaxed store or load. You do get the first two guarantees.

Implementations "should" (i.e. are encouraged to) make memory writes visible within a reasonable amount of time, even with relaxed ordering. That's about the best that can be said; sooner or later these things should show up.

So, yes, formally, an implementation that never made relaxed writes visible to relaxed reads conforms to the language definition. In practice, this won't happen.

As to what volatile does, ask your compiler vendor. It's up to the implementation.

answered Mar 17 '13 at 23:05

Pete Becker

74,985
8
76
165

Thanks, this is good information, but the more pertinent question is actually whether real hardware is capable and actually does an optimization that will elide reads in a loop completely (i.e. will a processor detect an infinite loop based on a cached value and just loop forever without doing anymore reads?) assuming that the reads are actually emitted by the compiler (which `volatile` usually does). It's sort more of a "does hardware actually do this" given a particular assembly output in practice question (I was thinking some kind of processor-level microcode optimizer might?). – Stephen Lin Mar 17 '13 at 23:08
The C++11 tag is possibly misleading since it only applies to the second part of the question. – Stephen Lin Mar 17 '13 at 23:10
The last line should be expanded as an answer to [SO questions](http://stackoverflow.com/questions/15466978/what-is-the-difference-between-sequential-consistency-and-atomicity) asking [about](http://stackoverflow.com/questions/3604569/what-kinds-of-optimizations-does-volatile-prevent-in-c) `volatile`. – didierc Mar 17 '13 at 23:25
`memory_order_relaxed` is highly specialized. If you don't know what you're doing, don't use it. And if you do know what you're doing, you probably shouldn't use it anyway. `` – Pete Becker Mar 17 '13 at 23:27
@PeteBecker I know :D I suggested a full memory barrier! I'm just curious if there's any chance it'll break without one, in real life. – Stephen Lin Mar 17 '13 at 23:28
Yeah. There are good tools that guarantee correct behavior, but for some reason folks want to take shortcuts. – Pete Becker Mar 18 '13 at 12:23

score 4 · Answer 2 · answered Mar 17 '13 at 23:11

4

It is technically legal for std::memory_order_relaxed loads to never, ever return a new value for the load. As for whether any implementation will do this, I have no clue.

Reference: http://www.developerfusion.com/article/138018/memory-ordering-for-atomic-operations-in-c0x/ "The only requirement is that accesses to a single atomic variable from the same thread can’t be reordered: once a given thread has seen a particular value of an atomic variable, a subsequent read by that thread can’t retrieve an earlier value of the variable."

answered Mar 17 '13 at 23:11

Patashu

21,443
3
45
53

+1, but really hoping someone with hardware knowledge can answer about what modern-day processors might actually do with unfenced reads. – Stephen Lin Mar 17 '13 at 23:14
@Stephen Lin I'm quite curious too! I don't know much about hardware. – Patashu Mar 17 '13 at 23:15
I wonder how hard it would be to have a kind of memory-ordering semantic which would require that for some specified N at least one access out of N must have an opportunity to see something happen? A compiler which 12x-unrolls a loop which will run a million iterations might, if N was 10,000 bump a counter every unrolled repetition (12x of the original loop) and then perform a "real" check every 832 times it bumps the counter. That could be much faster than having to insert a memory fence every time through the original code, and more convenient than having to manually unroll the loop... – supercat Jul 11 '16 at 21:39
...and add logic to invoke a fence every 832 repetitions. – supercat Jul 11 '16 at 21:39

Maciej Piechotka · Answer 3 · 2013-03-18T08:47:16.403

4

If the processors don't have cache-coherence protocol or have it very simple then it can 'optimize' the loads fetching stale data from cache. Now most modern multi-core CPUs implement cache coherency protocol. However the ARM before A9 did not have it. Non-CPU architectures also might not have cache-coherency (although they would arguably don't adhere to C++ memory model).

Other issue is that many architecture (including ARM and x86) allow reordering memory access. I don't know whether the processors are smart enough to notice repeated accesses to same address but I doubt it (it costs space and time for rare case, as compiler should be able to notice it, with small benefits, as later accesses will likely be L1 hits) but technically it can speculate that branch will be taken and it can reorder second access before first one (unlikely but if I read Intel and ARM manual correctly this is allowed).

Finally there are external devices which does not adhere to cache-coherency. If CPU communicates by memory mapped IO/DMA then the page must be marked as non-cachable (otherwise in L1/L2/L3/... cache would be staled data). In such cases processor will not usually reorder read and writes (for details consult your processor manual - it might have more fine-grained control) - compiler can so you need to use volatile. However as atomics are usually cache-based you don't need or can use them.

I am afraid that I cannot answer if such strong cache coherency will be available in future processors. I would suggest strictly following the specification ("What's wrong in storing pointer in int? Surely noone will ever user more then 4GiB so 32b address is big enough."). The correctness was answered by others so I won't include it.

edited Mar 18 '13 at 08:47

answered Mar 18 '13 at 00:53

Maciej Piechotka

7,028
6
39
61

thanks, this is probably the most complete answer so far...I guess it boils down to cache coherency, branch speculation, and reordering around across branches...agreed the "right" thing to do is not to assume anything that isn't guaranteed though... – Stephen Lin Mar 18 '13 at 01:02
to be clear, "they are convenient but they don't scale well with number of cores", you basically mean that future mainstream architectures could arguably provide less guarantees about cache coherency than current ones do now, because of the costs of doing so as the number of cores increases, right? just making sure I understand. – Stephen Lin Mar 18 '13 at 01:06
@StephenLin: I have literally no idea but this is one possibility. Of course such architecture might break part of existing code and not be viable for CPU. There is also research on cache-coherent NUMA so the new cache-coherent protocol might be more scalable (I would need to get back to notes). This is one possibility of course. – Maciej Piechotka Mar 18 '13 at 01:10
By NUMA I mean in practice supercomputer which consist of separate boards with CPU+Memory+GPU+NIC+ but with programs using flat memory space. – Maciej Piechotka Mar 18 '13 at 01:12
@MaciejPiechotoka in terms of breaking code, wouldn't reducing cache coherency only break code that wasn't using fence instructions correctly (and therefore was arguably buggy, like the example given in my question) anyway? seems like a reasonable price for progress, if it helps make correct code faster – Stephen Lin Mar 18 '13 at 01:15
@StephenLin: No. Imagine structure with 2 ints. Since cache line is usually much longer then word size they would be false shared. I.e. without cache coherency two threads writing to such structure would compete and one would win (write it's own version of whole cache-line). [GPUs are cache-coherent for this reason (I fixed the mistake in text)]. I would really need to revise the notes before answering further but in most cases the cache line is in exclusive mode (the core 'owns it' so it does not need to send request) hence it is possible that eliminating it would not be a huge benefit. – Maciej Piechotka Mar 18 '13 at 01:26
@MaciejPiechotka btw, is "stall data" a technical term or do you mean "stale data"? I can't find the former by googling. – Stephen Lin Mar 18 '13 at 04:47
@StephenLin: I mispelled stale data. – Maciej Piechotka Mar 18 '13 at 08:47

score 1 · Answer 4 · edited May 23 '17 at 12:33

1

Here's my take on it, though I don't have much knowledge on the topic, so take it with a grain of salt.

The volatile keyword effect might well be compiler dependent, but I will assume that it actually does what intuitively we expect from it, namely avoid aliasing or any other optimization which would not let a user inspect the value of the variable in a debugger at any point of execution during the life of that variable. That's pretty close to (and probably the same as) that answer on the meaning of volatile.

The direct implication is that any code block accessing the volatile variable v will have to commit it to memory as soon as it has modified it. Fences will make it so that it happens in order with other updates, but either way, there will be a store on v in the assembly output if v is modified at the source level.

Indeed, the question you ask is, if v, loaded in a register, has not been modified by some computation, what forces the CPU to execute a read from v again to any register, as opposed to simply reuse the value it has already got earlier.

I think the answer is that the CPU cannot assume that a memory cell hasn't changed from its last read. Memory access, even on a single core system, isn't strictly reserved to the CPU. Many other subsystems can access it read-write (that's the principle behind DMA).

The safest optimization that a CPU can probably do is check whether the value changed in cache or not, and use that as a hint of the state of v in memory. Caches should be kept in sync. with memory thanks to cache invalidation mechanisms attached with DMA. With that condition, the problem reverts to cache coherency on multicore, and "write after write" for multithreading situations. That last problem cannot be handled effectively with simple volatile variables, since their modification operation is not atomic, as you already know.

edited May 23 '17 at 12:33

Community

1
1

answered Mar 18 '13 at 00:41

didierc

14,572
3
32
52

thanks, didn't really consider the DMA issue...I'm going that you're right and that real hardware won't attempt anything like what the compiler does in this case. – Stephen Lin Mar 18 '13 at 01:03
(obviously, writing code like this still should be avoided though :D...it was not my suggestion to use `voltaile` after all!) – Stephen Lin Mar 18 '13 at 01:04
1

Accessing memory is usually very expensive and CPU assumes that it owns memory (otherwise there would be no point in cache) - exception is the multi-CPU system where they assume they co-own the memory. In fact many use Harvard architecture internally (assumes that code is read-only). The pages for DMA and I/O with devices are usually explicitly marked as such to prevent the stall data (i.e. they are not cached). – Maciej Piechotka Mar 18 '13 at 01:06
After re-reading - technically DMA can be plugged-in to the cache-coherence protocol but I haven't heard about it (edit - Wikipedia mentions them as one option). For non-SoC systems it would be rather expensive as the communications would involve going outside chip making cache invalidation much more expensive (in terms of time). – Maciej Piechotka Mar 18 '13 at 01:18
In situation of one producer only, a volatile (modulo the implementation of that keyword from the compiler) should be alright as a replacement for atomic, because the "write after write" problem disappear. But I've been told to stay away from `volatile` as a pseudo atomic from more experienced SO people, so I cannot really advocate it. Using the dedicated tools of the language is a safe bet. – didierc Mar 18 '13 at 01:31
@MaciejPiechotka I don't know how costly this could be: the way I see it, upon a DMA interrupt, the CPU just needs to check the ranges of cache lines and mark them invalidated if they cover an updated segment of memory (ie just flipping a bit). Then the update can be made "lazily" by just having the CPU making them "on demand", ie. precisely when the CPU must load from an invalid cache line. Can you elaborate a little on the cost part? – didierc Mar 18 '13 at 01:38
@MaciejPiechotka I guess my idea of handling cache lines with DMA is a bit odd. nevermind :) – didierc Mar 18 '13 at 01:45
@didierc: As of `volatile` - it still allows the CPU memory reordering even with one producer (not mentioning lack of atomic CAS or add even on single CPU). As of cost - every time unit something from memory it needs to request the data in exclusive mode. It needs to a) store the state and b) have broadcast to other devices. The device needs to handle it, reply etc. On SoC everything is near and can have physical wire hence low time overhead. Outside it needs to send request on bigger distance, deal with interference etc. – Maciej Piechotka Mar 18 '13 at 01:50
To put things into perspective - with modern CPU on desktop light travels 10cm every clock cycle. There is simply no time to access the device on other side of mother board to keep up (OK. we can allow multi-cycle access but there are delays on gates etc. 10 cm is more to illustrate that on modern hardware there is not so much time to do anything). – Maciej Piechotka Mar 18 '13 at 01:53
Ok, I think I got your point regarding cache invalidation cost. A non cacheable strategy is perhaps better, although reading directly from memory is pretty much the same as updating a cache line - anyway. I don't understand your point on reordering though. If only one thread is updating a volatile var., then there cannot be any race: readers will eventually get the updated value. – didierc Mar 18 '13 at 02:00
@didierc I think the point is that in some architectures with relaxed memory models, writes might not be committed to memory in the instruction order, so readers will eventually see all updated values but not necessarily in the right order – Stephen Lin Mar 18 '13 at 02:28

Can volatile but unfenced reads yield indefinitely stale values? (on real hardware)

4 Answers4

Linked