In 6.3.2 of this this excellent paper Ulrich Drepper writes about software prefetching. He says this is the "familiar pointer chasing framework" which I gather is the test he gives earlier about traversing randomized pointers. It makes sense in his graph that performance tails off when the working set exceeds the cache size, because then we are going to main memory more and more often.
But why does prefetch help only 8% here? If we are telling the processor exactly what we want to load, and we tell it far enough ahead of time (he does it 160 cycles ahead), why isn't every access satisfied by the cache? He doesn't mention his node size, so there could be some waste due to fetching a full line when only some of the data is needed?
I am trying to use _mm_prefetch with a tree and I see no noticeable speed up. I'm doing something like this:
_mm_prefetch((const char *)pNode->m_pLeft, _MM_HINT_T0);
// do some work
traverse(pNode->m_pLeft);
traverse(pNode->m_pRight)
Now that should only help one side the traversal, but I just see no change at all in performance. I did add /arch:SSE to the project settings. I'm using Visual Studio 2012 with an i7 4770. In this thread a few people also talk about getting only 1% speedup with prefetch. Why does prefetch not work wonders for random access of data that's in main memory?