Cache thrashing, general help in understanding

Question

I am trying to understand cache thrashing, is the following text correct?

Taking the code below to be an example.

long max = 1024*1024;
long a(max), b(max), c(max), d(max), e(max); 
for(i = 1; i < max; i++) 
    a(i) = b(i)*c(i) + d(i)*e(i);

The ARM Cortex A9 is four way set associative and each cache line is 32 bytes, total cache is 32kb. In total there are 1024 cache lines. In order to carry out the above calculation one cache line must be displaced. When a(i) is to be calculated, b(i) will be thrown out. Then as the loop iterates, b(i) is needed and so another vector is displaced. In the example above, there is no cache reuse.

To solve this problem, you can introduce padding between the vectors in order to space out their beginning address. Ideally, each padding should be at least the size of a full cache line.

The above problem can be solved as such

long a(max), pad1(256), b(max), pad2(256), c(max), pad3(256), d(max), pad4(256), e(max)

For multidimensional arrays, it is enough to make the leading dimension an odd number.

Any help if the above is true or where I have made an error.

Thanks.

score 3 · Answer 1 · edited May 23 '17 at 12:30

Each vector needs 8MB of memory(1024 * 1024 * 8B, assuming 8B for long). So if these vectors are contiguously allocated, then a(i), b(i), c(i), d(i) and e(i) will map to the same cache set(not same cache line always, as it is 2 way). Nevertheless there can only be two of them at a time in the cache set. So when cache lines containing d(i) and e(i) are brought in cache, cache lines containing b(i) and c(i) will be evicted.

If you are sure that these vectors are contiguously allocated, you can just pad them with one cache line size i.e. 32B. That will do the trick. So a(i), b(i), c(i), d(i) and e(i) will be on contiguous cache sets. And after accessing 4 elements of a vector, each cache line will be evicted. This is because each cache line contains 4 long variables(a(0), a(1), a(2), a(3) will on the same cache line, as will be a(4), a(5), a(6), a(7)).

So you declare your vectors like

long a(max),pad1(32),b(max),pad2(32),c(max),pad3(32),d(max),pad4(32),e(max)

For related discussion, you can follow this link

why-is-one-loop-so-much-slower-than-two-loops

I understand that the equation to find out to which set a memory address goes to is Set Number = (memory address) mod (cache size) So that would mean: a[0] =(1024*4) mod 32Kb = 0. Set 0 b[0] =(2048*4)+(32*4) mod 32Kb = 0. Set 0 And so on.. I guess I need to force them onto a different set? Otherwise there are not enough ways to keep all the data in cache? — user1876942, Feb 17 '14 at 06:44

Cache thrashing, general help in understanding

1 Answers1

Linked