Prefetch from MMIO?

Question

Is it possible to issue a prefetch for an address backed by an MMIO region in a PCIe BAR (and mapped via either UC or WC page table entries)? I am currently issuing a load for this address which causes the hyperthread to stall for quite some time. There is a non-temporal access hint via PREFETCHNTA, so it seems like this may be possible.

If it is possible, do you know where the prefetched value is stored and what would possibly cause it to become invalidated before I am able to issue a load for it? For example, if I issue a synchronizing instruction such as sfence for something unrelated, would this cause the prefetched value to become invalidated?

From the Intel Software Development Manual:

"Prefetches from uncacheable or WC memory are ignored. ... It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types)."

The PCIe BAR that the MMIO region is in is marked as prefetchable, so I am not sure if that means prefetches will work with it given the language from the manual above.

If `prefetchnta` isn't ignored on WC memory, it would probably bring data into LFBs on Intel CPUs (like SSE4.1 `movntdqa` loads from WC - not into any level of cache). `sfence` just orders operations, it doesn't forcibly evict anything. (Except maybe on AMD, it might wait for the store buffer to drain; I think AMD's `sfence` has much stronger ordering semantics than Intel's, even guaranteed on paper). `sfence` might also evict *dirty* LFB or whatever the AMD equivalent to make past NT stores visible before later stores. But I wouldn't expect it to touch clean LFBs from loads. — Peter Cordes, Nov 12 '22 at 03:04
`mfence` or maybe even `lfence` might evict data in LFBs that got there from `movntda` loads, to make sure later `movntdqa` loads got "fresh" data. Related: [Non-temporal loads and the hardware prefetcher, do they work together?](https://stackoverflow.com/q/32103968) and the Intel whitepaper it links about SSE4.1 movntdqa to copy from WC video RAM back to WB DRAM. — Peter Cordes, Nov 12 '22 at 03:07
@PeterCordes Hi Peter, thanks a lot for your response. I wonder if it is worth marking the part of the region I want to prefetch as WT? I can then use simple software logic to ensure that what I end up reading is not stale (e.g., with a counter)? — Jack Humphries, Nov 12 '22 at 03:38
Hmm, yes apparently it's theoretically possible to use WT cacheable mappings on MMIO regions, just not WB. [Mapping MMIO region write-back does not work](https://stackoverflow.com/q/53311131) quotes Dr. Bandwidth. Also [how to do mmap for cacheable PCIe BAR](https://stackoverflow.com/q/11254023) has an answer from him, again saying that the hardware does support WT for this. — Peter Cordes, Nov 12 '22 at 03:40
@PeterCordes Thanks Peter. I took a look at the second link which describes a scenario quite similar to what I want to do. I commented on John's answer with a question, though please let me know if you have the answer to that question. Basically, how do you invalidate the cache line from the cache when you want to load in the latest cache line value from the PCIe BAR? CLFLUSH seems to write the cache line to the BAR in addition to invalidating it, which is not what I want. — Jack Humphries, Nov 12 '22 at 06:53

Jack Humphries · Accepted Answer · 2022-11-20T00:30:01.590

I'd like to thank Peter Cordes, John D McCalpin, Neel Natu, Christian Ludloff, and David Mazières for their help with figuring this out!

In order to prefetch, you need to be able to store MMIO reads in the CPU cache hierarchy. When you use UC or WC page table entries, you cannot do this. However, you can use the cache hierarchy if you use WT page table entries.

The only caveat is that when you use WT page table entries, previous MMIO reads with stale data can linger in the cache. You must implement a coherence protocol in software to flush the stale cache lines from the cache and read the latest data via an MMIO read. This is alright in my case because I control what happens on the PCIe device, so I know when to flush. You may not know when to flush in all scenarios though, which could make this approach unhelpful to you.

Here is how I set up my system:

Mark the page table entries that map to the PCIe BAR as WT. You can use ioremap_wt() for this (or ioremap_change_attr() if the BAR has already been mapped into the kernel).
According to https://sandpile.org/x86/coherent.htm, there are conflicts between the PAT type and the MTRR type. The MTRR type for the PCIe BAR must also be set to WT, otherwise the PAT WT type is ignored. You can do this with the command below. Be sure to update the command with the PCIe BAR address (which you can see with lspci -vv) and the PCIe BAR size. The size is a hexadecimal value in units of bytes.

echo "base=$ADDRESS size=$SIZE type=write-through" >| /proc/mtrr

As a quick check at this point, you may want to issue a large number of MMIO reads in a loop to the same cache line in the BAR. You should see the cost per MMIO read go down substantially after the first MMIO read. The first MMIO read will still be expensive because you need to fetch the value from the PCIe device, but the subsequent reads should be much cheaper because they all read from the cache hierarchy.
You can now issue a prefetch to an address in the PCIe BAR and have the prefetched cache line stored in the cache hierarchy. Linux has the prefetch() function to help with issuing a prefetch.
You must implement a simple coherence protocol in software to ensure that stale cache lines backed by the PCIe BAR are flushed from the cache. You can use clflush to flush a stale cache line. Linux has the clflush() function to help with this.

A note about clflush in this scenario: Since the memory type is WT, each store goes to both the cache line in the cache and the MMIO. Thus, from the CPU's perspective, the contents of the cache line in the cache always match the contents of the MMIO. Therefore, clflush will just invalidate the cache line in the cache -- it will not also write the stale cache line to the MMIO.

Note that in my system, I immediately issue a prefetch after the clflush. However, the code below is incorrect:

clflush(address);
prefetch(address);

This code is incorrect, because according to https://c9x.me/x86/html/file_module_x86_id_252.html, the prefetch could be reordered before the clflush. Thus, the prefetch could be issued before the clflush, and the prefetch would presumably be invalidated when the clflush occurs.

To fix this, according to the link, you should issue cpuid in between the clflush and the prefetch:

int eax, ebx, ecx, edx;

clflush(address);
cpuid(0, &eax, &ebx, &ecx, &edx);
prefetch(address);

Peter Cordes said it is sufficient to issue an lfence instead of cpuid above.

Just to expand on that last point: prefetch instructions can't even start doing anything until they actually execute, because they're implemented as instructions no some other thing processed during decode or something. And `lfence` prevents that from happening until after all previous instructions have retired ("completed locally"). An `lfence` can't make sure that an earlier `prefetch` is finished (because prefetch effects aren't ordered wrt. fences: https://www.felixcloutier.com/x86/prefetchh), but it can stop a later `prefetch` from starting before any earlier stuff. — Peter Cordes, Nov 20 '22 at 01:25
On AMD CPUs, this depends on the OS having enabled the MSR that makes `lfence` work like on AMD CPUs. ([Is LFENCE serializing on AMD processors?](https://stackoverflow.com/q/51844886) - it's never a "serializing instruction" like `cpuid` is, or the upcoming `serialize` in Sapphire Rapids, but it will serialize *execution*, sending work to execution units.) A full serializing instruction like `cpuid` also drains the ROB (like `mfence`). — Peter Cordes, Nov 20 '22 at 01:27

Prefetch from MMIO?

1 Answers1

Linked