7

My understanding is that hardware prefetching will never cross page boundaries. I'm wondering if a software prefetch has the same restriction i.e. can I use a software prefetch to avoid a future TLB miss. From searching around, it appears to be possible, but I couldn't find anything definitive in the documentation, so a reference would be good.

I'm specifically interested in Nehalem, Sandy Bridge and Westmere.

jmetcalfe
  • 1,296
  • 9
  • 17
  • 1
    Update: IvyBridge does do HW prefetch across page boundaries. https://stackoverflow.com/a/20758769/224132. It's a new feature in IvB, and from other things I've read I think it's accurate to say that SnB and earlier Intel don't prefetch into the next page. Speculative TLB loads are a thing, though, at least when triggered by speculative execution of a load/store instruction. – Peter Cordes Sep 02 '17 at 00:20

2 Answers2

2

According to Intel's Optimization Reference Manual, it depends on the processor. From section 7.4.3:

There are cases where a PREFETCH will not perform the data prefetch. These include:

  • PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.
  • An access to the specified address that causes a fault/exception.

Software prefetching may or may not avoid TLB misses, depending on the processor. It will not fetch the data if it would cause a page fault.

If you want ensure you avoid TLB misses, you could do a dummy read to load the data instead of a prefetch instruction. This could cause a page fault to swap in a page, which could be either good or bad depending on your use case.

ughoavgfhw
  • 39,734
  • 6
  • 101
  • 123
  • The families of CPU mentioned by OP are not pentium4 class CPUs. – didierc Feb 09 '13 at 02:30
  • @didierc I wouldn't know, but I copied that directly from Intel's manual, and [wikipedia](http://en.wikipedia.org/wiki/List_of_Intel_Pentium_4_microprocessors) lists all of them except model 0 as Pentium 4. – ughoavgfhw Feb 09 '13 at 03:12
  • Good pointer. OP's interested in the latest generations of ia64 CPU (intel i5 & i7 class, afaik). I was just trying to help narowing down a little your answer. My bad if it sounded harsh, it wasn't my intent. – didierc Feb 09 '13 at 03:34
  • Yeh, I couldn't find anything for the later generations I mentioned. I presume that the dummy read is rather more expensive than async prefetch, though obviously that has some overhead too. – jmetcalfe Feb 09 '13 at 10:26
  • @jmetcalfe: Yes, the dummy read can't retire until it completes, even if nothing uses the result. Since the ROB (reorder buffer) is only ~168 entries in Sandybridge (http://www.realworldtech.com/sandy-bridge/5/), that's as little as 168/4-per-clock = 42 cycles of latency-hiding before it blocks new instructions from entering the out-of-order part of the core. A page-walk + cache miss can take *much* longer than that. A `prefetch` instruction should be *much* better than a dummy read for triggering page-walks early. – Peter Cordes Sep 02 '17 at 00:27
2

In modern processors (Nehalem, Sandy Bridge and Westmere) software prefetching does indeed trigger a TLB lookup.

From the Intel optimization guide: (section 7.3.3)

In older microarchitectures, PREFETCH causing a Data Translation Lookaside Buffer (DTLB) miss would be dropped. In processors based on Nehalem, Westmere, Sandy Bridge, and newer microar-chitectures, Intel Core 2 processors, and Intel Atom processors, PREFETCH causing a DTLB miss can be fetched across a page boundary.

jleahy
  • 16,149
  • 6
  • 47
  • 66