In this article: http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484?pgno=2
says, that we can't do any optimization for volatile
, even such as (where: volatile int& v = *(address);
):
v = 1; // C: write to v
local = v; // D: read from v
can't be optimized to this:
v = 1; // C: write to v
local = 1; // D: read from v // but it can be done for std::atomic<>
It is can't be done, because between 1st and 2nd lines may v
value be changed by hardware device (not CPU where can't work cache coherence: network adapter, GPU, FPGA, etc...) (sequentila/concurrency), which mapped to this memory location. But it is make sense only if v
can't be cached in CPU-cache L1/2/3, because for usual (non-volatile
) variable between 1st and 2nd line too small time and is likely to trigger cached.
Does volatile
qualifier guarantees no caching for this memory location?
ANSWER:
- No,
volatile
doesn't guarantee no caching for this memory location, and there aren't anything about this in C/C++ Standards or compiler manual. - Using memory mapped region, when memory mapped from device memory to CPU-memory is already marked as WC (write combining) instead of WB, that cancels the caching. And need not to do cache-flushing.
- An opposite, if CPU-memory mapped to the device memory, then incidentally, the controller PCIE, located on crystal of CPU, is snooping for data which going through DMA from this device, and updates(invalidate) CPU-cache L3. In this case, if the executable code on the device using the
volatile
tries to perform the same two lines, it also cancels the cache memory of the device (e.g. in the cache GPU-L2). And need not to do GPU-cache-flushing and need not to do CPU-cache-flushing. Also for CPU might need to usestd::atomic_thread_fence(std::memory_order_seq_cst);
if L3-cache(LLC) coherency with DMA over PCIE, but L1/L2 is not. And for nVidia CUDA we can use:void __threadfence_system();
- We need to flushing DMA-controllers-cache, when sending unaligned data: (WDK:
KeFlushIoBuffers(), FlushAdapterBuffers()
) - Also, we can mark any memory region as uncached as WC-marked by yourself via the MTRR registers.