I am reading "Pro .NET Benchmarking" by Andrey Akinshin and one thing puzzles me (p.536) -- explanation how cache associativity impacts performance. In a test author used 3 square arrays 1023x1023, 1024x1024, 1025x1025 of ints
and observed that accessing first column was slower for 1024x1024 case.
Author explained (background info, CPU is Intel with L1 cache with 32KB memory, it is 8-way associative):
When N=1024, this difference is exactly 4096 bytes; it equals the critical stride value. This means that all elements from the first column match the same eight cache lines of L1. We don’t really have performance benefits from the cache because we can’t use it efficiently: we have only 512 bytes (8 cache lines * 64-byte cache line size) instead of the original 32 kilobytes. When we iterate the first column in a loop, the corresponding elements pop each other from the cache. When N=1023 and N=1025, we don’t have problems with the critical stride anymore: all elements can be kept in the cache, which is much more efficient.
So it looks like the penalty comes from somehow shrinking the cache just because the main memory cannot be mapped to full cache.
It strikes me as odd, after reading wiki page I would say the performance penalty comes from resolving address conflicts. Since each row can be potentially mapped into the same cache line, it is conflict after conflict, and CPU has to resolve those -- it takes time.
Thus my question, what is the real nature of performance problem here. Accessible memory size of cache is lower, or entire cache is available but CPU spends more time in resolving conflicts with mapping. Or there is some other reason?