I'm reading about cache blocking on this intel page.
It says:
Blocking is a well-known optimization technique that can help avoid memory bandwidth bottlenecks in a number of applications. The key idea behind blocking is to exploit the inherent data reuse available in the application by ensuring that data remains in cache across multiple uses.
Providing an example
for (body1 = 0; body1 < NBODIES; body1 ++) { for (body2=0; body2 < NBODIES; body2++) { OUT[body1] += compute(body1, body2); } }
In this example, data (body2) is streamed from memory. Assuming NBODIES is large, we would have no reuse in cache. This application is memory bandwidth bound. The application will run at the speed of memory to CPU speeds, less than optimal.
which optimises to
for (body2 = 0; body2 < NBODIES; body2 += BLOCK) { for (body1=0; body1 < NBODIES; body1 ++) { for (body22=0; body22 < BLOCK; body22 ++) { OUT[body1] += compute(body1, body2 + body22); } } }
The blocked code ... is obtained by splitting the body2 loop into an outer loop iterating over bodies in multiple of BLOCK, and an inner body22 loop iterating over elements within the block, and interleaving the body1 and body2 loops. This code reuses a set of BLOCK body2 values across multiple iterations of the body1 loop. If BLOCK is chosen such that this set of values fits in cache, memory traffic is brought down by a factor of BLOCK.
Honestly, I don't understand the accompanying explanations.
I also don't understand how memory traffic is brought down, and why the blocked example would be faster.
Please can someone help explain cache blocking and how it actually speeds up loops?
Wikipedia similarly has an example:
for (i=0; i<N; ++i) { ... }
can be blocked with a block size B by replacing it with
for (j=0; j<N; j+=B) { for (i=j; i<min(N, j+B); ++i) { .... } }
I have no idea why this would be faster.
Why does this take advantage of the cache?