13

In this article: http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484?pgno=2 says, that we can't do any optimization for volatile, even such as (where: volatile int& v = *(address);):

v = 1;                // C: write to v
local = v;            // D: read from v

can't be optimized to this:

v = 1;                // C: write to v
local = 1;            // D: read from v  // but it can be done for std::atomic<>

It is can't be done, because between 1st and 2nd lines may v value be changed by hardware device (not CPU where can't work cache coherence: network adapter, GPU, FPGA, etc...) (sequentila/concurrency), which mapped to this memory location. But it is make sense only if v can't be cached in CPU-cache L1/2/3, because for usual (non-volatile) variable between 1st and 2nd line too small time and is likely to trigger cached.

Does volatile qualifier guarantees no caching for this memory location?

ANSWER:

  1. No, volatile doesn't guarantee no caching for this memory location, and there aren't anything about this in C/C++ Standards or compiler manual.
  2. Using memory mapped region, when memory mapped from device memory to CPU-memory is already marked as WC (write combining) instead of WB, that cancels the caching. And need not to do cache-flushing.
  3. An opposite, if CPU-memory mapped to the device memory, then incidentally, the controller PCIE, located on crystal of CPU, is snooping for data which going through DMA from this device, and updates(invalidate) CPU-cache L3. In this case, if the executable code on the device using the volatile tries to perform the same two lines, it also cancels the cache memory of the device (e.g. in the cache GPU-L2). And need not to do GPU-cache-flushing and need not to do CPU-cache-flushing. Also for CPU might need to use std::atomic_thread_fence(std::memory_order_seq_cst); if L3-cache(LLC) coherency with DMA over PCIE, but L1/L2 is not. And for nVidia CUDA we can use: void __threadfence_system();
  4. We need to flushing DMA-controllers-cache, when sending unaligned data: (WDK: KeFlushIoBuffers(), FlushAdapterBuffers())
  5. Also, we can mark any memory region as uncached as WC-marked by yourself via the MTRR registers.
Community
  • 1
  • 1
Alex
  • 12,578
  • 15
  • 99
  • 195
  • Memory mapped IO registers used to be very, *very* common back in the days when c was maturing. They are the usual example of what `volatile` is for. Any such optimization of access to that kind of device would break the programmers intent. – dmckee --- ex-moderator kitten Aug 31 '13 at 17:30
  • FYI, stackoverflow.com encourages answering your own question. When you do answer your own question, I am pretty sure it is best to answer in a separate answer. – Trevor Boyd Smith May 06 '15 at 14:57

2 Answers2

8

volatile ensures that the variable won't be "cached" in CPU register. CPU cache is transparent to the programmer and if another CPU writes to the memory mapped by another CPU's cache, the second CPU's cache gets invalidated, therefore it will reload the value from the memory again during the next access.

Something about Cache coherence

As for the external memory writes (via DMA or another CPU-independent channel), you might need to flush the cache manually (see this SO question)


C Standard §6.7.3 7:

What constitutes an access to an object that has volatile-qualified type is implementation-defined.

Community
  • 1
  • 1
Erbureth
  • 3,378
  • 22
  • 39
  • Thanks, but I mean hardware devices, which can not communicate through cache coherency by using mapped memroy. Now I highlighted it in bold. – Alex Aug 31 '13 at 17:26
  • Yes, it wasn't clear from the question before. I'm looking for more information right now. – Erbureth Aug 31 '13 at 17:27
  • 1
    Thanks! Only answer is: `volatile` **doesn't cancel caching**, because memory mapped from device memory to CPU-memory is **already marked as WC** (write combining) instead of WB, **that cancels the caching**. And cache need not to be flushed, isn't it? Quote:"For benchmarking purposes, the simplest solution is probably copying a large memory block to a region marked with WC(write combining) instead of WB. The memory mapped region of the graphics card is a good candidate, or you can mark a region as WC by yourself via the MTRR registers." http://stackoverflow.com/a/1757198/1558037 – Alex Aug 31 '13 at 17:53
  • 1
    In addition: An opposite, if CPU-memory mapped to the device memory, then incidentally, the controller PCIE, located on crystal of CPU, is snooping for data which going through DMA from this device, and updates(invalidate) CPU-cache L3. In this case, if the executable code on the device using the `volatile` tries to perform the same two lines, it also cancels the cache memory of the device, e.g. in the cache L2 (GPU). http://stackoverflow.com/a/12028433/1558037 – Alex Aug 31 '13 at 18:03
  • @Alex, that's true. volatile usually just prevents the compiler from doing certain optimizations when reading/writing the location. It's even compiler specific what exactly volatile does. See e.g. http://gcc.gnu.org/onlinedocs/gcc/Volatiles.html for gcc – nos Aug 31 '13 at 18:04
  • @Erbureth Thanks for the your edited answer, but why you wrote: "you might need to flush"? According to your link about memory mapped region is explicitly stated, that is need not, or there are cases when may be necessary for the memory mapped region? – Alex Aug 31 '13 at 18:13
  • 1
    @Alex According to [this](http://stackoverflow.com/a/10140118/624664) SO answer, marking memory un-cacheable is only one approach to maintain cache coherency, so I would not universally count on it. E. g. see [MSDN](http://msdn.microsoft.com/en-us/library/windows/hardware/ff545924%28v=vs.85%29.aspx) edit: I see it is about DMA cache, not CPU – Erbureth Aug 31 '13 at 18:18
  • @nos Thanks for link, yes, there is not anything about cache for `volatile`. – Alex Aug 31 '13 at 18:18
0

The semantics of volatile are implementation-defined. If a compiler knew that interrupts would be disabled while certain piece of code was executed, and knew that on the target platform there would be no means other than interrupt handlers via which operations on certain storage would be observable, it could register-cache volatile-qualified variables within such storage just the same as it could cache ordinary variables, provided it documented such behavior.

Note that what aspects of behavior are counted as "observable" may be defined in some measure by the implementation. If an implementation documents that it is not intended for use on hardware which uses main RAM accesses to trigger required externally-visible actions, then accesses to main RAM would not be "observable" on that implementation. The implementation would be compatible with hardware which was capable of physically observing such accesses, if nothing cared whether any such accesses were actually seen. If such accesses were required, however, as they would be if the accesses were regarded as "observable", however, the compiler would not be claiming compatibility and would thus make no promise about anything.

Nagappa
  • 194
  • 2
  • 13