How does cache blocking actually speed up performance?

Question

I'm reading about cache blocking on this intel page.

It says:

Blocking is a well-known optimization technique that can help avoid memory bandwidth bottlenecks in a number of applications. The key idea behind blocking is to exploit the inherent data reuse available in the application by ensuring that data remains in cache across multiple uses.

Providing an example

for (body1 = 0; body1 < NBODIES; body1 ++) {
   for (body2=0; body2 < NBODIES; body2++) {
     OUT[body1] += compute(body1, body2);
   }
}
In this example, data (body2) is streamed from memory. Assuming NBODIES is large, we would have no reuse in cache. This application is memory bandwidth bound. The application will run at the speed of memory to CPU speeds, less than optimal.

which optimises to

for (body2 = 0; body2 < NBODIES; body2 += BLOCK) {
   for (body1=0; body1 < NBODIES; body1 ++) {
      for (body22=0; body22 < BLOCK; body22 ++) {
         OUT[body1] += compute(body1, body2 + body22);
      }
   }
}
The blocked code ... is obtained by splitting the body2 loop into an outer loop iterating over bodies in multiple of BLOCK, and an inner body22 loop iterating over elements within the block, and interleaving the body1 and body2 loops. This code reuses a set of BLOCK body2 values across multiple iterations of the body1 loop. If BLOCK is chosen such that this set of values fits in cache, memory traffic is brought down by a factor of BLOCK.

Honestly, I don't understand the accompanying explanations.

I also don't understand how memory traffic is brought down, and why the blocked example would be faster.

Please can someone help explain cache blocking and how it actually speeds up loops?

Wikipedia similarly has an example:

for (i=0; i<N; ++i) {
  ...
}
can be blocked with a block size B by replacing it with
for (j=0; j<N; j+=B) {
  for (i=j; i<min(N, j+B); ++i) {
    ....
  }
}

I have no idea why this would be faster.

Why does this take advantage of the cache?

score 4 · Answer 1 · answered Aug 27 '20 at 11:31

That Intel paper is bad because it does not make clear the association between the index body2 and where the data is located in memory or the association between body1 and data in memory. The idea is OUT[body1] is going to use multiple elements from the same cache block for several consecutive values of body1. However, those uses of consecutive values of body1 do not occur one after the other. They are spread out because the entire loop on body2 is executed between them. Presumably, using compute(body1, body2) causes use of various memory depending on body2 values.

The notion is that, over the course of the body2 loop, so many different places in memory will be accessed that the cache block associated with OUT[body1] will be evicted from cache, so, when body1 is incremented to the next value, but OUT[body1] would still be in the same cache block, that block is no longer in cache and has to be loaded again. So that is wasteful.

Changing the source code to:

for (body2 = 0; body2 < NBODIES; body2 += BLOCK) {
   for (body1=0; body1 < NBODIES; body1 ++) {
      for (body22=0; body22 < BLOCK; body22 ++) {
         OUT[body1] += compute(body1, body2 + body22);
      }}
// Note: A closing "}" is missing in the original.

addresses that problem by processing fewer instances of compute(body1, something) before changing body1. If BLOCK is set to be small enough, the cache block associated with OUT[body1] is still in cache and does not need to be reloaded from memory. Thus, during the inner loops:

   for (body1=0; body1 < NBODIES; body1 ++) {
      for (body22=0; body22 < BLOCK; body22 ++) {
         OUT[body1] += compute(body1, body2 + body22);
      }}

each cache block for OUT[body1] is only loaded from memory once and written to memory once.

Of course, these inner loops are inside a new outer loop on body2, so the inner loops will be executed NBODIES/BLOCK times (the paper neglects block fragments that occur if NOBODIES is not exactly divisible by BLOCK, and I will too), so the cache blocks of OUT[body1] still have to be reloaded NBODIES/BLOCK times, but that is better than reloading them NBODIES times, which occurs in the original code.

How does cache blocking actually speed up performance?

1 Answers1

Linked