6

In his great publication, Ulrich Drepper goes over a test benchmark that I cannot quite wrap my head around.

He is talking about caches and prefetching. He first shows a test where he is accessing an array of elements of 16 bytes each (one pointer and one 64bit integer, each element has a pointer to its next but that's not quite relevant here, really) and, for each element, he increments its value by one.

He then proceeds to show another test where he is accessing the same array, but this time he is storing in each element the sum of its value with the value of the next element.

The data for these two tests is then compared and he shows that, with a working set smaller than the total L2D$ size (but bigger than the total L1D$ size), the second test performs better than the first, and his motivation is that the read from the next element acts as a "forced prefetch", thus improving performance.

Now, what I don't understand is, how could that read act as a prefetch when we are not just prefetching that line but actually reading from it and using that data immediately after? Shouldn't that read stall just like how it happens when a new element is accessed in the first test? In fact, in my mind, I see the second example as very similar to the first, with the only difference that we are storing in the previous element, and not in the most recent one (and we are summing the two instead of incrementing).

To have a more precise reference to the actual text, the test in question is talked about in page 22, third right paragraph, and its relative graph is the figure 3.13 on the following page.

Finally, I'll report the relevant graph here, cropped out. The first test corresponds to the blue "Inc" line, the second corresponds to the green "Addnext0" line. For reference, the red "Follow" line performs no writes, only sequential reads.

enter image description here

castan
  • 383
  • 1
  • 4
  • 11
  • 3
    If the CPU could only fetch one cache line at a time your reasoning would be true. The DRAM and the CPU can serve/fetch cache lines parallelly (or interleaved). Thus accessing this and the next element takes almost the same time as accessing just this element. That's the very reason for prefetching: doing it in *parallel*, in background, with other things. This is just a simplified reason, someone will answer properly soon I hope :) – Margaret Bloom Mar 22 '18 at 18:11
  • 2
    @MargaretBloom: The phrase you're looking for is *pipelined* memory access. castan: modern Intel CPUs have an L1d cache that can track up to 10 cache misses in flight at once, and L2 can have 16 pending transfers with L3 (https://stackoverflow.com/questions/45783251/what-is-the-semantics-for-super-queue-and-line-fill-buffers). See also [my 2017 commentary/update on Ulrich's "What every programmer should know about memory"](https://stackoverflow.com/questions/8126311/what-every-programmer-should-know-about-memory/47714514#47714514) for some more memory-bandwidth links and info. – Peter Cordes Mar 22 '18 at 18:19
  • @MargaretBloom Interesting. But still, I have to ask, if "accessing this and the next element takes almost the same time as accessing just this element", why is the second test *faster* than the first? Given your premise, it should be almost as fast as. We are still accessing one new element per iteration (the "current" has already been fetched by the "previous" but the "next" has not). – castan Mar 22 '18 at 20:55
  • @PeterCordes Thank you for the very interesting links. As of now I still cannot quite see the implication of your statement on the test I've posted but hopefully going through the material will help. – castan Mar 22 '18 at 20:56
  • 1
    Agreed, Margaret's comment isn't really an answer. Demand misses trigger a HW prefetcher that keeps making requests to prefetch ahead of future demand misses. See https://stackoverflow.com/questions/20544917/prefetching-data-at-l1-and-l2 for more about the L1d and L2 HW prefetchers in modern Intel CPUs. – Peter Cordes Mar 22 '18 at 22:29
  • 1
    Update, the L2 streamer triggers on L2 *access*, not miss. Even L2 hits can trigger prefetch of following lines, so prefetch doesn't even wait for a demand miss when reading up to the end of a cached region. https://stackoverflow.com/questions/47732454/is-prefetching-triggered-by-the-stream-of-exact-addresses-or-by-the-stream-of-ca#comment82833985_47732454 – Peter Cordes Mar 22 '18 at 22:46
  • That makes perfect sense if you stop to think about it - if the prefetcher works well, there wouldn't *be* any demands misses to trigger further prefetches, would there? :) – Leeor Mar 24 '18 at 21:51
  • [The Intel Optimization Manual](https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.html) has the same idea (Page 128). They show 2 redundant loads as being used to hint the DCU prefetcher to get the next line. My guess is that reading each items value twice is trying to do the same thing i.e coax the DCU prefetcher to prefetch more aggressively. – Noah Apr 16 '21 at 20:30

0 Answers0