In his great publication, Ulrich Drepper goes over a test benchmark that I cannot quite wrap my head around.
He is talking about caches and prefetching. He first shows a test where he is accessing an array of elements of 16 bytes each (one pointer and one 64bit integer, each element has a pointer to its next but that's not quite relevant here, really) and, for each element, he increments its value by one.
He then proceeds to show another test where he is accessing the same array, but this time he is storing in each element the sum of its value with the value of the next element.
The data for these two tests is then compared and he shows that, with a working set smaller than the total L2D$ size (but bigger than the total L1D$ size), the second test performs better than the first, and his motivation is that the read from the next element acts as a "forced prefetch", thus improving performance.
Now, what I don't understand is, how could that read act as a prefetch when we are not just prefetching that line but actually reading from it and using that data immediately after? Shouldn't that read stall just like how it happens when a new element is accessed in the first test? In fact, in my mind, I see the second example as very similar to the first, with the only difference that we are storing in the previous element, and not in the most recent one (and we are summing the two instead of incrementing).
To have a more precise reference to the actual text, the test in question is talked about in page 22, third right paragraph, and its relative graph is the figure 3.13 on the following page.
Finally, I'll report the relevant graph here, cropped out. The first test corresponds to the blue "Inc" line, the second corresponds to the green "Addnext0" line. For reference, the red "Follow" line performs no writes, only sequential reads.