Will other threads see a write to a `volatile` word sized variable in reasonable time?

Question

When asking about a more specific problem I discovered this is the core issue where people are not exactly sure.

The following assumptions can be made:

CPU does use a cache coherency protocol like MESI(F) (examples: x86/x86_64 and ARMv7mp)
variable is assumed to be of a size which is atomically written/read by the processor (aligned and native word size)
The variable is declared volatile

The questions are:

If I write to the variable in one thread, will other threads see the change?
What is the order of magnitude of the timeframe in which the other threads will see the change?
Do you know of architectures where cache coherency is not enough to ensure cross-CPU / cross-core visibility?

The question is NOT:

Is it safe to use such a variable?
about reordering issues
about C++11 atomics

This might be considered a duplicate of In C/C++, are volatile variables guaranteed to have eventually consistent semantics betwen threads? and other similar questions, but I think none of these have those clear requirements regarding the target architecture which leads to a lot of confusion about differing assumptions.

@MartinJames, in multithreading it is - to my experience - very hard to write reliable tests. Many things work 99.99% of the time while still failing under some timings. I ask this question in order to *understand* how it works in theory and learn what problems I might have not anticipated at all. — Hannah S., Nov 30 '15 at 16:57
I entirely agree with the previous comment, testing is not useful here. However, the linked question has undefined behaviour, so I see no point in discussing "what if" questions regarding tweaks that don't remove the undefined behaviour. You say "The question is NOT [...]" but it's pointless to answer a question that says "The question is not about doing this correctly, I want to know the performance of undefined behaviour" . — Jonathan Wakely, Nov 30 '15 at 17:03
If you want to work at the hardware level then write asm, and make sure you know what you're doing, don't write C++ with undefined behaviour and try to predict what the compiler will do to undefined code. — Jonathan Wakely, Nov 30 '15 at 17:10
I think that the question you link *is* a duplicate, although the discussion in the comments to the accepted answer is a bit distracting. What the standard says is "An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time." (section 1.10 paragraph 28); there is no mention of `volatile` there. — rici, Nov 30 '15 at 17:11
Your question is general? `volatile` for arbitrary type? Or specific? Maybe this reference can give some complementary information: http://en.cppreference.com/w/cpp/language/cv (it's not an answer, but I thought maybe interesting to know though). — Ely, Nov 30 '15 at 18:15
I tried to make it general enough for the typical scenario. My question is about types that can be stored/loaded by the processor atomically. Typically an int should qualify for that. — Hannah S., Nov 30 '15 at 18:17
The point is that atomicity (i.e. not reading/writing only part of a memory location) is necessary for correct multithreaded behaviour, **but not sufficient**. You also need memory consistency guarantees, and `volatile` does not give any such guarantees. The C and C++ standards are clear that only _atomic operations_ give such guarantees (which are special operations that include any necessary memory barriers ... the use of atomic here does not just mean no partial reads/writes!). See the links at the bottom of http://cxx.isvolatileusefulwiththreads.com/ for lots more detail. — Jonathan Wakely, Dec 02 '15 at 11:04
@JonathanWakely: I know that. But C++11 also supports memory_order_relaxed which does not have these guarantees (only atomic loads/stores) - or am I wrong here? My question is whether use of volatile equals use of memory_order_relaxed when used with variables that can be stored/loaded atomically. — Hannah S., Dec 02 '15 at 11:10
https://github.com/datacratic/boost-svn/blob/master/boost/atomic/detail/cas32weak.hpp (used for ARM) and https://github.com/datacratic/boost-svn/blob/master/boost/atomic/detail/gcc-x86.hpp seem to indicate that at least the boost guys think so for GCC on ARM and x86. For example a store is implemented as `const_cast(v_) = v;` — Hannah S., Dec 02 '15 at 11:10

rcgldr · Answer 1 · 2015-11-30T17:53:46.700

Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?

I"m not aware of any single processor with multiple cores that has cache coherency issues. It might be possible for someone to use the wrong type of processor in a multi-processor board, for example an Intel processor that has what Intel calls external QPI disabled, but this would cause all sorts of issues.

Wiki article about Intel's QPI and which processors have it enabled or disabled:

http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect

score 1 · Answer 2 · answered Jan 23 '16 at 00:01

If I write to the variable in one thread, will other threads see the change?

There is no guarantee. If you think there is, show me where you found it.

What is the order of magnitude of the timeframe in which the other threads will see the change?

It can be never. There is no guarantee.

Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?

This is an incoherent question because you are talking about operations in C++ code that has to be compiled into assembly code. Even if you have hardware guarantees that apply to assembly code, there's no guarantee those guarantees "pass through" to C++ code.

But to the extent the question can be answered, the answer is yes. Posted writes, read prefetching, and other kinds of caching (such as what compilers do with registers) exist in real platforms.

score 0 · Answer 3 · answered Nov 30 '15 at 17:36

I'd say no, there is no guarantee. There are implementations using multiple, independent computers where shared data has to be transmitted over a (usually very fast) connection between computers. In that situation, you'd try to transmit data only when it is needed. This might be triggered by mutexes, for example, and by the standard atomic functions, but hopefully not by stores into arbitrary local memory, and maybe not by stores into volatile memory.

I may be wrong, but you'd have to prove me wrong.

score 0 · Answer 4 · answered Jan 22 '16 at 23:54

Assuming nowadays x86/64:

If I write to the variable in one thread, will other threads see the change?

Yes. Assuming you use a modern and not very old / buggy compiler.

What is the order of magnitude of the timeframe in which the other threads will see the change?

It really depends how you measure. Basically, this would be the memory latency time = 200 cycles on same NUMA node. About double on another node, on a 2-node box. Might differ on bigger boxes. If your write gets reordered relatively to the point of time measurement, you can get +/-50 cycles.

I measured this a few years back and got 60-70ns on 3GHz boxes and double that on the other node.

Do you know of architectures where cache coherency is not enough to insure cross-cpu / cross-core visibility?

I think the meaning of cache coherency is visibility. Having said that, I'm not sure Sun risk machines have the same cache coherency, and relaxed memory model, as x86, so I'd test very carefully on them. Specifically, you might need to add memory release barriers to force flushing of memory writes.

score 0 · Answer 5 · answered Jan 23 '16 at 05:12

Given the assumptions you have described, there is no guarantee that a write of a volatile variable in one thread will be "seen" in another.

Given that, your second question (about the timeframe) is not applicable.

With (multi-processor) PowerPC architectures, cache coherency is not sufficient to ensure cross-core visibility of a volatile variable. There are explicit instructions that need to be executed to ensure state is flushed (and to make it visible across multiple processors and their caches).

In practice, on architectures that require such instructions to be executed, the implementation of data synchronisation primitives (mutexes, semaphores, critical sections, etc) does - among other things - use those instructions.

More broadly, the volatile keyword in C++ has nothing to do with multithreading at all, let alone anything to do with cross-cache coherency. volatile, within a given thread of execution, translates to a need for things like fetches and writes of the variable not being eliminated or reordered by the compiler (which affects optimisation). It does not translate into any requirement about ordering or synchronisation of the completion of fetches or writes between threads of execution - and such requirements are necessary for cache coherency.

Notionally, a compiler might be implemented to provide such guarantees. I've yet to see any information about one that does so - which is not surprising, as providing such a guarantee would seriously affect performance of multithreaded code by forcing synchronisation between threads - even if the programmer has not used synchronisation (mutexes, etc) in their code.

Similarly, the host platform could also notionally provide such guarantees with volatile variables - even if the instructions being executed don't specifically require them. Again, that would tend to reduce performance of multithreaded programs - including modern operating systems - on those platforms. It would also affect (or negate) the benefits of various features that contribute to performance of modern processors, such as pipelining, by forcing processors to wait on each other.

If, as a C++ developer (as distinct from someone writing code that exploits specific features offered by your particular compiler or host platform) you want a variable written in one thread able to be coherently read by another thread, then don't bother with volatile. Perform synchronisation between threads - when they need to access the same variable concurrently - using provided techniques - such as mutexes. And follow the usual guidelines on using those techniques (e.g. use mutexes sparingly and minimise the time which they are held, do as much as possible in your threads without accessing variables that are shared between threads at all).

Will other threads see a write to a `volatile` word sized variable in reasonable time?

5 Answers5