How does cache associativity impact performance

Question

I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints and observed that accessing first column was slower for 1024x1024 case.

Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):

When N=1024, this difference is exactly 4096 bytes; it equals the critical stride value. This means that all elements from the first column match the same eight cache lines of L1. We don’t really have performance benefits from the cache because we can’t use it efficiently: we have only 512 bytes (8 cache lines * 64-byte cache line size) instead of the original 32 kilobytes. When we iterate the first column in a loop, the corresponding elements pop each other from the cache. When N=1023 and N=1025, we don’t have problems with the critical stride anymore: all elements can be kept in the cache, which is much more efficient.

So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.

It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.

Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?

All the accessed items maps to the same set of the cache. The item can go in any of the 8 cache lines in the set but since only **one** set, out of 64, is used, only 8 cache lines can hold data from the array (instead of the 64*8 = 512 cache lines). So it's like the cache is made of only 512 bytes. It called [conflict miss](https://stackoverflow.com/questions/33314115/whats-the-difference-between-conflict-miss-and-capacity-miss). Remember: the cache is *not* fully associative, the same address maps always to the same set. — Margaret Bloom, Aug 04 '19 at 16:37
Also https://stackoverflow.com/questions/6060985/why-is-there-huge-performance-hit-in-2048x2048-versus-2047x2047-array-multiplica?noredirect=1&lq=1 — Leeor, Aug 04 '19 at 20:52
See https://stackoverflow.com/questions/56909987/slowdown-when-accessing-data-at-page-boundaries — Rick James, Aug 17 '19 at 21:43

score 1 · Answer 1 · answered Aug 14 '19 at 21:44

Caching is a layer between two other layers. In your case, between the CPU and RAM. At its best, the CPU rarely has to wait for something to be fetched from RAM. At its worst, the CPU usually has to wait.

The 1024 example hits a bad case. For that entire column all words requested from RAM land in the same cell in cache (or the same 2 cells, if using a 2-way associative cache, etc).

Meanwhile, the CPU does not care -- it asks the cache for a word from memory; the cache either has it (fast access) or needs to reach into RAM (slow access) to get it. And RAM does not care -- it responds to requests, whenever they come.

Back to 1024. Look at the layout of that array in memory. The cells of the row are in consecutive words of RAM; when one row is finished, the next row starts. With a little bit of thought, you can see that consecutive cells in a column have addresses differing by 1024*N, when N=4 or 8 (or whatever the size of a cell). That is a power of 2.

Now let's look at the relatively trivial architecture of a cache. (It is 'trivial' because it needs to be fast and easy to implement.) It simply takes several bits out of the address to form the address in the cache's "memory".

Because of the power of 2, those bits will always be the same -- hence the same slot is accessed. (I left out a few details, like now many bits are needed, hence the size of the cache, 2-way, etc, etc.)

A cache is useful when the process above it (CPU) fetches an item (word) more than once before that item gets bumped out of cache by some other item needing the space.

Note: This is talking about the CPU->RAM cache, not disk controller caching, database caches, web site page caches, etc, etc; they use more sophisticated algorithms (often hashing) instead of "picking a few bits out of an address".

Back to your Question...

So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.

There are conceptual problems with that quote.

Main memory is not "mapped to a cache"; see virtual versus real addresses.
The penalty comes when the cache does not have the desired word.
"shrinking the cache" -- The cache is a fixed size, based on the hardware involved.

Definition: In this context, a "word" is a consecutive string of bytes from RAM. It is always(?) a power-of-2 bytes and positioned at some multiple of that in the reall address space. A "word" for caching depends on vintage of the CPU, which level of cache, etc. 4-, 8-, 16-byte words probably can be found today. Again, the power-of-2 and positioned-at-multiple... are simple optimizations.

Back to your 1K*1K array of, say, 4-byte numbers. That adds up to 4MB, plus or minus (for 1023, 1025). If you have 8MB of cache, the entire array will eventually get loaded, and further actions on the array will be faster due to being in the cache. But if you have, say, 1MB of cache, some of the array will get in the cache, then be bumped out -- repeatedly. It might not be much better than if you had no cache.

Thank you for your answer, however (don't get me wrong), since the L1 cache is 32KB your answer does not explains the difference why in particular 1024x1024 is slower than the rest. After all any of those arrays fit into the cache entirely. So the key point here is associativity of the cache, because it is the source of that difference in speed. — greenoldman, Aug 16 '19 at 16:29
@greenoldman - How wide is a "word" in your cache? 16 bytes would include INTs from 4 columns. Is it N-way associative? Any chance the issue comes from the L2 cache? (And other details will help give the answer.) — Rick James, Aug 16 '19 at 23:14
As for width of the word I don't know. L1 in this example is 8-way associative. L2 might be involved, but I doubt it is a reason, luckily the author presented results for 1023, 1024, 1025 sizes, so the slower part is right in the middle, it rules out speculation if it is too big, or too small, etc. So it is true it is "side effect" of aligning to critical stride of L1. — greenoldman, Aug 17 '19 at 10:59

How does cache associativity impact performance

1 Answers1

Linked