Why is prefetch speedup not greater in this example?

Question

In 6.3.2 of this this excellent paper Ulrich Drepper writes about software prefetching. He says this is the "familiar pointer chasing framework" which I gather is the test he gives earlier about traversing randomized pointers. It makes sense in his graph that performance tails off when the working set exceeds the cache size, because then we are going to main memory more and more often.

But why does prefetch help only 8% here? If we are telling the processor exactly what we want to load, and we tell it far enough ahead of time (he does it 160 cycles ahead), why isn't every access satisfied by the cache? He doesn't mention his node size, so there could be some waste due to fetching a full line when only some of the data is needed?

graph of prefetch improvement

I am trying to use _mm_prefetch with a tree and I see no noticeable speed up. I'm doing something like this:

_mm_prefetch((const char *)pNode->m_pLeft, _MM_HINT_T0);
// do some work
traverse(pNode->m_pLeft);
traverse(pNode->m_pRight)

Now that should only help one side the traversal, but I just see no change at all in performance. I did add /arch:SSE to the project settings. I'm using Visual Studio 2012 with an i7 4770. In this thread a few people also talk about getting only 1% speedup with prefetch. Why does prefetch not work wonders for random access of data that's in main memory?

On modern CPUs it's hard to beat the automatic prefetch unless (a) you have an unusual/predictable access pattern, (b) you *really* know what you're doing, (c) you are prepared to tune for specific CPUs and (d) you have memory bandwidth to spare. — Paul R, May 19 '14 at 15:28
Hmmmm, but what about the graph? How does he still get 1000 cycles/element if he's telling the processor exactly what address he is going to read next? It seems like in the steady state he should be back down below 200 cycles/node regardless of the working set size. All the fetching should happen while he's doing the work on each node. I know my mental model must be leaving a lot out, just not sure what. — Philip, May 19 '14 at 16:05
@Philip No, because as your dataset gets larger you will be prefetching from main memory, with a lot fewer cache hits. Small work sets probably reside in cache completely. — Anycorn, May 19 '14 at 16:30
@Anycom, right but the whole idea of a pre-fetch is to initiate a transfer from main memory prior to needing it. In his example he does 160 cycles of work in each node. Isn't a main memory fetch on the order of 200 cycles? So why does it take 1000 cycles if you initiate a 200 cycles read 160 cycles ahead of needing it? What paul-r and what Ulrich reports is consistent with my experience, it doesn't help much, but it seems like it _should_ help a lot. — Philip, May 19 '14 at 17:44
@Philip There is more to it than just raw memory latency, eg TLB misses are very costly. — Anycorn, May 19 '14 at 20:44
@PaulR, have you seen a case where `_mm_prefetch` helps on modern processors? The only one I know of is one my Mystical and it was less than 10% which is not very significant, the processor was older, and even then there was a lot of debate in the comments on his answer if the 10% was even real (I don't remember which answer it was). — Z boson, May 20 '14 at 07:31
In some relatively rare cases I think it can help, but it's fiendishly difficult to get it right, and ideally it needs to be tuned for a given CPU, clock speed, memory subsystem, etc. I think Intel's ICC can generate prefetch hints automatically, but they don't always seem to help, and may even make things worse. I think this kind of "last few percent" tuning is only really worth pursuing if you're writing a "black box" library function which is going to be widely used, e.g. `memcpy` or an FFT routine. — Paul R, May 20 '14 at 07:40
@Zboson, i've shown a 20% example [here](http://stackoverflow.com/questions/20697215/when-should-we-use-prefetch/20758769#20758769) without too much effort, i'm sure with more work you can grow it even further. It's just that prefetching a tree structure like that isn't useful. You'll run ahead across the leftmost branch, and then get stuck with all the right side children and their respective sub branches. This will still form a linked data structure so you'll still be stuck fetching every one of them before you can get the next one. I'm not saying prefetch won't help, it just will be limited — Leeor, May 22 '14 at 21:53
@Leeor, thanks! Mysticial's link which I talked about is there as well. I'll have to look over these carefully now to see what I can learn. — Z boson, May 23 '14 at 07:03

Peter Cordes · Answer 1 · 2015-08-02T19:22:06.473

Prefetch can't increase the throughput of your main memory, it can only help you get closer to using it all.

If your code spends many cycles on computation before even requesting data from the next node in a linked list, it won't keep the memory 100% busy. A prefetch of the next node as soon as the address is known will help, but there's still an upper limit. The upper limit is approximately what you'd get with no prefetching, but no work between loading a node and chasing the pointer to the next. i.e. memory system fetching a result 100% of the time.

Even prefetching before 160 cycles of work isn't far enough ahead for the data to be ready, according to the graph in that paper. Random access latency is apparently really slow, since DRAM has to open a new page, a new row, and a new column.

I didn't read the paper in enough detail see how he could prefetch multiple steps ahead, or to understand why a prefetch thread helped more than prefetch instructions. This was on a P4, not Core or Sandybridge microarchitecture, and I don't think prefetch threads are still a thing. (Modern CPUs with hyperthreading have enough execution units and big enough caches that running two independent things on the two hardware threads of each core actually makes sense, unlike in P4 where there were less extra execution resources normally going unused for hyperthreading to utilize. And esp. I-cache was a problem in P4, because it just had that small trace cache.)

If your code already loads the next node essentially right away, prefetching can't magically make it faster. Prefetching helps when it can increase the overlap between CPU computation and waiting for memory. Or maybe in your tests, the ->left pointers were mostly sequential from when you allocated the memory, so HW prefetching worked? If trees were shallow enough, prefetching the ->right node (into last-level cache, not L1) before descending the left might be a win.

Software prefetching is only needed when the access pattern is not recognizable for the CPUs hardware prefetchers. (They're quite good, and can spot pattern with a decent-size stride. And track something like 10 forward streams (increasing addresses). Check http://agner.org/optimize/ for details.)

score -1 · Answer 2 · answered Dec 22 '14 at 13:11

-1

How big is the node you want to prefetch? Because the prefetcher can't exceed the 4K page boundary: if your node is bigger, you will pre-load only a part of the data, while the remaining data will be loaded only after a miss event.

answered Dec 22 '14 at 13:11

user1466329

156
1
9

Why is prefetch speedup not greater in this example?

2 Answers2

Linked