Some questions related to cache performance (computer architecture)

Question

Details about the X5650 processor at https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20X5650%20-%20AT80614004320AD%20(BX80614X5650).html

important notes: L3 cache size : 12288KB cache line size : 64

Consider the following two functions, which each increment the values in an array by 100.

void incrementVector1(INT4* v, int n) {
                       for (int k = 0; k < 100; ++k) {
                           for (int i = 0; i < n; ++i) {
                               v[i] = v[i] + 1;
} }
}
                   void incrementVector2(INT4* v, int n) {
                       for (int i = 0; i < n; ++i) {
                           for (int k = 0; k < 100; ++k) {
                               v[i] = v[i] + 1;
} }
}

The following data collected using the perf utility captures runtime information for executing
each of these functions on the Intel Xeon X5650 processor for various data sizes. In this data: • the program vector1.bin executes the function incrementVector1;
• the program vector2.bin executes the function incrementVector2;
• the programs take a command line argument which sets the value of n;
• both programs begin by allocating an array of size n and initializing all elements to 0. • LLC-loads means “last level cache loads”, the number of accesses to L3;
• LLC-load-misses means “last level cache misses”, the number of L3 cache misses.
Runtime performance of vector1.bin.
Performance counter stats for ’./vector1.bin 1000000’:
230,070      LLC-loads
3,280        LLC-load-misses           #    1.43% of all LL-cache references
0.383542737 seconds time elapsed
Performance counter stats for ’./vector1.bin 3000000’:
669,884      LLC-loads
242,876      LLC-load-misses           #   36.26% of all LL-cache references
1.156663301 seconds time elapsed
Performance counter stats for ’./vector1.bin 5000000’:
1,234,031    LLC-loads
898,577      LLC-load-misses           #   72.82% of all LL-cache references
1.941832434 seconds time elapsed
Performance counter stats for ’./vector1.bin 7000000’:
1,620,026      LLC-loads
1,142,275      LLC-load-misses           #   70.51% of all LL-cache references
2.621428714 seconds time elapsed
Performance counter stats for ’./vector1.bin 9000000’:
2,068,028      LLC-loads
1,422,269      LLC-load-misses           #   68.77% of all LL-cache references
3.308037628 seconds time elapsed
8
Runtime performance of vector2.bin.
Performance counter stats for ’./vector2.bin 1000000’:
16,464     LLC-loads
1,168      LLC-load-misses            #   7.049% of all LL-cache references
0.319311959 seconds time elapsed
Performance counter stats for ’./vector2.bin 3000000’:
42,052      LLC-loads
17,027      LLC-load-misses           #   40.49% of all LL-cache references
0.954854798 seconds time elapsed
Performance counter stats for ’./vector2.bin 5000000’:
63,991      LLC-loads
38,459      LLC-load-misses           #   60.10% of all LL-cache references
1.593526338 seconds time elapsed
Performance counter stats for ’./vector2.bin 7000000’:
99,773      LLC-loads
56,481      LLC-load-misses           #   56.61% of all LL-cache references
2.198810471 seconds time elapsed
Performance counter stats for ’./vector2.bin 9000000’:
120,456     LLC-loads
76,951      LLC-load-misses           #   63.88% of all LL-cache references
2.772653964 seconds time elapsed

I have two questions:

Consider the cache miss rates for vector1.bin. Between the vector sizes 1000000 and 5000000, the cache miss rate drastically increases. What is the cause of this increase in cache miss rate?
Consider the cache miss rates for both programs. Notice that the miss rate between the two programs is roughly equal for any particular array size. Why is that?

next time [edit] your question to fix formatting instead of [re-posting a question](https://stackoverflow.com/q/65951959). — Peter Cordes, Jan 29 '21 at 19:31
Same functions as [Cache misses of an array and nested loop ordering](https://stackoverflow.com/q/65926332) - see my comments on Craig's answer there. But since you're counting only LLC (last-level cache) misses, 1000000 is ~30% of 12MiB, while 5000000 is 159% of the 12MiB L3 cache size. So your cache miss rate increases when your working set size exceeds cache size. Nothing special there. (but also, for smaller sizes HW prefetch reduces the number of L2 misses, reducing LLC-loads significantly.) — Peter Cordes, Jan 29 '21 at 19:37
To answer part 2, how did you compile? What exact GCC version and options, assuming you were using GCC? Note that `gcc -O3` will turn the inner loop into `v[i] += 100` for Vector2, unless you make it `volatile`. — Peter Cordes, Jan 29 '21 at 19:39
Hi, you said your cache miss rate increases when your working set size exceeds cache size. However, how do you explain it when vector size is between 1000000 and 3000000, which in this case the cache capacity could still hold the whole array. — Adam, Jan 30 '21 at 08:52
Also, between 1000000 and 3000000, I think only compulsory misses happen in L3 but why the cache miss rate keep going up ? Once the vector size exceeds the capacity of the cache, the capacity miss would occur. Could you explain more details why the cache miss rate keep going up between 1000000 and 3000000, also after 3000000? — Adam, Jan 30 '21 at 08:58
`3000000*4 / (12*1024^2)` is 95% of L3 cache capacity. Other things have some memory footprint, like code, page tables, the OS's interrupt handlers, etc. Remember that L3 cache is shared by all cores, and Intel used inclusive caches from Nehalem up until Broadwell-Xeon. (Your Westmere-EP is 2nd-gen Nehalem). (For anything to be cache in the private caches of another core, it has to be valid in L3 tags.) So it's not too surprising that miss rate starts to climb as you approach the size limit. Any miss is likely to randomly evict some more array data, kind of amplifying any interference. — Peter Cordes, Jan 30 '21 at 08:58
I mean only consider the information I gave. You said miss likely to miss is likely to randomly evict some more array data, how ? As I known, for the first time l3 load blocks from memory it will be cold miss. Even though it approaches the size limit it could still hold the whole array as the array sizes not exceed the cache capacity. I just wanna know the reason why miss rate starts to climb as it approaches the size limit. How could you judge it from the information I provide ? — Adam, Jan 30 '21 at 09:34
Like I said, because the array isn't the only thing L3 has to hold. Any disturbance (from interrupt handlers on this or other cores, for example) gets magnified by pseudo-LRU eviction for the conflict miss. I'm not talking about the mandatory misses during startup. — Peter Cordes, Jan 30 '21 at 16:40
Bro, I have another question. How do you consider the temporal and spatial locality of this two functions. In my view, I think the temporal and spatial locality of v2 is better than v1. How do you think ? — Adam, Feb 01 '21 at 17:45
Yes obviously doing all the work in one pass over the array has much better locality. It can actually be slower, though, if a naive compiler doesn't keep `v[i]` in a register for the whole inner loop. (Or if you force it not to with `volatile`). If you have the bandwidth, then v1 avoids bottlenecking on store-forwarding latency. But if you let the compiler optimize properly, v2 is vastly better, essentially just doing `v[i] += 100` in one pass. Again, see comments on [Cache misses when accessing an array in nested loop](https://stackoverflow.com/posts/comments/116565493) — Peter Cordes, Feb 01 '21 at 18:26

Some questions related to cache performance (computer architecture)

0 Answers0