Software prefetching across page boundary on x86

Question

My understanding is that hardware prefetching will never cross page boundaries. I'm wondering if a software prefetch has the same restriction i.e. can I use a software prefetch to avoid a future TLB miss. From searching around, it appears to be possible, but I couldn't find anything definitive in the documentation, so a reference would be good.

I'm specifically interested in Nehalem, Sandy Bridge and Westmere.

Update: IvyBridge does do HW prefetch across page boundaries. https://stackoverflow.com/a/20758769/224132. It's a new feature in IvB, and from other things I've read I think it's accurate to say that SnB and earlier Intel don't prefetch into the next page. Speculative TLB loads are a thing, though, at least when triggered by speculative execution of a load/store instruction. — Peter Cordes, Sep 02 '17 at 00:20

score 2 · Answer 1 · answered Feb 09 '13 at 00:56

2

According to Intel's Optimization Reference Manual, it depends on the processor. From section 7.4.3:

There are cases where a PREFETCH will not perform the data prefetch. These include:

PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.

An access to the specified address that causes a fault/exception.

Software prefetching may or may not avoid TLB misses, depending on the processor. It will not fetch the data if it would cause a page fault.

If you want ensure you avoid TLB misses, you could do a dummy read to load the data instead of a prefetch instruction. This could cause a page fault to swap in a page, which could be either good or bad depending on your use case.

answered Feb 09 '13 at 00:56

ughoavgfhw

39,734
6
101
123

The families of CPU mentioned by OP are not pentium4 class CPUs. – didierc Feb 09 '13 at 02:30
@didierc I wouldn't know, but I copied that directly from Intel's manual, and [wikipedia](http://en.wikipedia.org/wiki/List_of_Intel_Pentium_4_microprocessors) lists all of them except model 0 as Pentium 4. – ughoavgfhw Feb 09 '13 at 03:12
Good pointer. OP's interested in the latest generations of ia64 CPU (intel i5 & i7 class, afaik). I was just trying to help narowing down a little your answer. My bad if it sounded harsh, it wasn't my intent. – didierc Feb 09 '13 at 03:34
Yeh, I couldn't find anything for the later generations I mentioned. I presume that the dummy read is rather more expensive than async prefetch, though obviously that has some overhead too. – jmetcalfe Feb 09 '13 at 10:26
@jmetcalfe: Yes, the dummy read can't retire until it completes, even if nothing uses the result. Since the ROB (reorder buffer) is only ~168 entries in Sandybridge (http://www.realworldtech.com/sandy-bridge/5/), that's as little as 168/4-per-clock = 42 cycles of latency-hiding before it blocks new instructions from entering the out-of-order part of the core. A page-walk + cache miss can take *much* longer than that. A `prefetch` instruction should be *much* better than a dummy read for triggering page-walks early. – Peter Cordes Sep 02 '17 at 00:27

score 2 · Answer 2 · answered Sep 01 '17 at 10:36

In modern processors (Nehalem, Sandy Bridge and Westmere) software prefetching does indeed trigger a TLB lookup.

From the Intel optimization guide: (section 7.3.3)

In older microarchitectures, PREFETCH causing a Data Translation Lookaside Buffer (DTLB) miss would be dropped. In processors based on Nehalem, Westmere, Sandy Bridge, and newer microar-chitectures, Intel Core 2 processors, and Intel Atom processors, PREFETCH causing a DTLB miss can be fetched across a page boundary.

Software prefetching across page boundary on x86

2 Answers2