How to read stale values on x86

Question

My goal is to read in stale and outdated values of memory without cache-coherence. I have attempted to use prefetchnta to perform a non-temporal load, but it failed to fetch outdated values. I am looking into performing some kind of Streaming Memory-to-Memory Direct-Memory-Access, but am having a little trouble due to the overwhelming amount of background knowledge required to proceed with my current project. Currently I am attempting to mess around with udmabuf but even that is going slowly. It should be noted that ideally I would like to ignore the contents of all CPU caches, including the current CPU.

To provide my reasoning as to why: I am developing software that can be used to prove correctness of programs written for non-volatile memory. As the CPU Cache is volatile, the CPU's write-back cache will still be volatile and the arbitrary nature of how they are written back to memory needs to be observed.

I would sincerely appreciate it if someone could give me some pointers of how to proceed. I do not mind digging into the Linux kernel, as in fact I am doing that now, nor do I mind modifying it, I just need a little guidance in the right direction.

I'm not sure I understand how the cache is involved here, but if you want to bypass it entirely wouldn't be easier to disable it altogether? I can't think of a way to bypass a cached value but I'd try with mapping the same page as UC/UC- and try using that. Possibly from another core if the CPU somehow still read the cached line or write it back (and you don't want this). IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case. — Margaret Bloom, Nov 04 '18 at 17:29
Where is the stale value supposed to come from? memory or cache? By NVM you mean secondary storage or persistent memory? How does your program work basically? — Hadi Brais, Nov 04 '18 at 22:04
@HadiBrais Stale value would come from the CPU cache, and I mean persistent memory. The tool that I am developing is supposed to simulate power-failure by 'freezing' the process, taking a snapshot of user-specified areas of memory, and then 'unfreeze' the process. The snapshot should contain stale values, which is what would be in main memory after recovering from the power failure. My primary issue is obtaining the 'snapshot' of memory. — Louis Jenkins, Nov 04 '18 at 22:27
@MargaretBloom That does sound like a good idea! Btw what does UC/UC- stand for? Uncached/Uncached-? — Louis Jenkins, Nov 04 '18 at 22:27
Louis, you should carefully read the sections of the Intel SDM on caching and then return here if you still have questions. — prl, Nov 04 '18 at 22:51
If the memory snapshot contains stale value then how can stale values come from the cache? When storing to persistent memory, the most recent values would be always in the cache, and potentially in memory. I don't see how the cache may contain stale values but the memory contains up-to-date values. You are saying that a snapshot of memory would be captured and that may contain stale values (which makes perfect sense), but then stale values would come from the cache (which makes no sense to me). — Hadi Brais, Nov 05 '18 at 01:53
Note that you can just forcibly terminate the process after taking a snapshot and then see if you can recover from that snapshot correctly. — Hadi Brais, Nov 05 '18 at 01:54
@LouisJenkins Yes, they mean UnCached and Strongly UnCached (UC-) where speculation is also disabled. — Margaret Bloom, Nov 05 '18 at 13:23

Peter Cordes · Accepted Answer · 2022-06-09T18:57:51.850

I haven't played around with this, but my understanding from the docs is that for loads (unlike NT stores) nothing can bypass cache or override the strong ordering of memory types like the normal WB (write-back). And even NT stores evict already-cached data, so they can't break coherence for this or another core that has cached data for the line you're writing.

You can do weakly-ordered loads from WC (write-combining) memory regions (with prefetchnta or SSE4 movntdqa), but they're probably still coherent at the physical address level.

@MargaretBloom commented

IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case.

so maybe you could actually bypass cache coherence with multiple virtual mappings of the same physical page.

I don't know if it's possible to do non-coherent DMA with a PCI / PCIe device, but that might be your only hope for getting actual DRAM contents without going through cache.

Normally (always?) DMA on modern x86 systems is cache-coherent, which is good for performance. To maintain backwards compat with 386 and earlier CPUs without caches, the first x86 CPUs with caches had cache-coherent DMA, not introducing cache-control instructions until later generations, since existing OSes didn't use them. In modern systems, memory controllers are built-in to the CPU. So on Intel CPUs, the system agent can snoop L3 tags to see if a line is cached anywhere on-chip in parallel with sending the request to the memory controller. Or a Xeon can DMA right into L3 cache without data having to bounce through DRAM, good for high bandwidth NICs.

There's an INVD instruction which invalidates all caches without doing write-back first, but I think that includes the shared L3 cache, and probably the private caches of all other cores. So you can't practically use it on a Linux system where other cores are potentially in the middle of doing stuff; you'd potentially corrupt kernel data structures by using it, as well as simulating power failure on a machine with NVDIMMs for the process you were interested in.

Maybe if you somehow offlined all the other CPU cores, and disabled interrupts on the one core that was still up

you could wbinvd (write-back+invalidate) to flush all caches
then run some code under test
then invd and see what made it to DRAM

Then re-enable interrupts. Interrupt handlers could end up with some kernel data cached and some in memory, or get device drivers out of sync with hardware, if any interrupts are handled between the wbinvd and the invd.

Update: someone did actually attempt this:

How to run "invd" instruction with disabled SMP support?
How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading - invd worked so well it nuked some of the stores done by printk in the mis-designed attempt to log something about it.

Nice idea! Linux supports CPU hot plugging. PCIe transactions have a "no snoop" attribute bit, when set cache coherence is not *required*. However they don't say it's *forbidden*. I think the PCIe root complex will honour that bit and route the transaction directly to the memory controller. — Margaret Bloom, Nov 05 '18 at 14:04
Thank you, you didn't just answer my question, but gave me three separate approaches to obtaining a solution.Truly appreciate this as now I have a path forward. — Louis Jenkins, Nov 05 '18 at 14:19

How to read stale values on x86

1 Answers1