1

I'm trying to understand what creates cache misses and eventually how much do they cost in terms of performance in our application. But with the tests I'm doing now, I'm quite confused.

Assuming that my L3 cache is 4MB, and my LineSize is 64 bytes, I would expect that this loop (loop 1):

int8_t aArr[SIZE_L3];
int i;
for ( i = 0; i < (SIZE_L3); ++i )
{
  ++aArr[i];
}

...and this loop (loop 2):

int8_t aArr[SIZE_L3];
int i;
for ( i = 0; i < (SIZE_L3 / 64u); ++i )
{
  ++aArr[i * 64];
}

give roughly the same amount of Last Level Cache Misses, but different amount of Inclusive Last Level Cache References.

However the numbers that the profiler of Visual Studio 2013 gives me are unsettling.

With loop 1:

  • Inclusive Last Level Cache References: 53,000
  • Last Level Cache Misses: 17,000

With loop 2:

  • Inclusive Last Level Cache References: 69,000
  • Last Level Cache Misses: 35,000

I have tested this with a dynamically allocated array, and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.

Why don't I get the same amount of cache misses, and why do I have more references in a shorter loop?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Vaillancourt
  • 1,380
  • 1
  • 11
  • 42
  • What exact hardware are you testing on? I assume it's some kind of recent Intel, since you mention L3 cache and visual-studio. But some AMD Piledriver chips have L3 cache, and probably have different prefetch algorithms. Even the version of Intel uarch matters, maybe significantly here since you're testing with exactly L3 size. Intel IvyBridge (introduced [an adaptive replacement policy for L3 cache](http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/)) to mitigate pollution of a hot working set in code that also loops over a giant data set with low reuse. – Peter Cordes Jul 28 '16 at 08:35

2 Answers2

2

Incrementing every byte of int8_t aArr[SIZE_L3]; separately is slow enough that hardware prefetchers are probably able to keep up pretty well a lot of the time. Out-of-order execution can keep a lot of read-modify-writes in flight at once to different addresses, but the best-case is still one byte per clock of stores. (Bottleneck on store-port uops, assuming this was a single-threaded test on a system without a lot of other demands for memory bandwidth).

Intel CPUs have their main prefetch logic in L2 cache (as described in Intel's optimization guide; see the tag wiki). So successful hardware prefetch into L2 cache before the core issues a load means the that L3 cache never sees a miss.

John McCalpin's answer on this Intel forum thread confirms that L2 hardware prefetches are NOT counted as LLC references or misses by the normal perf events like MEM_LOAD_UOPS_RETIRED.LLC_MISS. Apparently there are OFFCORE_RESPONSE events you can look at.

IvyBridge introduced next-page HW prefetch. Intel Microarchitectures before that don't cross page boundaries when prefetching, so you still get misses every 4k. And maybe TLB misses if the OS didn't opportunistically put your memory in a 2MiB hugepage. (But speculative page-walks as you approach a page boundary can probably avoid much delay there, and hardware definitely does do speculative page walks).


With a stride of 64 bytes, execution can touch memory much faster than the cache / memory hierarchy can keep up. You bottleneck on L3 / main memory. Out-of-order execution can keep about the same number of read/modify/write ops in flight at once, but the same out-of-order window covers 64x more memory.


Explaining the exact numbers in more details

For array sizes right around L3, IvyBridge's adaptive replacement policy probably makes a significant difference.

Until we know the exact uarch, and more details of the test, I can't say. It's not clear if you only ran that loop once, or if you had an outer repeat loop and those miss / reference numbers are an average per iteration.

If it's only from a single run, that's a tiny noisy sample. I assume it was somewhat repeatable, but I'm surprised the L3 references count was so high for the every-byte version. 4 * 1024^2 / 64 = 65536, so there was still an L3 reference for most of the cache lines you touched.

Of course, if you didn't have a repeat loop, and those counts include everything the code did besides the loop, maybe most of those counts came from startup / cleanup overhead in your program. (i.e. your program with the loop commented out might have 48k L3 references, IDK.)


I have tested this with a dynamically allocated array

Totally unsurprising, since it's still contiguous.

and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.

Did this test use a larger array? Or did you use a 4MiB array on a CPU with an 8MiB L3?

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

Your assumption that "If I skip over more elements in the array, making for fewer iterations of the loop and fewer array accesses, that I should have fewer cache misses" seems to be ignoring the way that data gets fetched into the cache.

When you access memory, more data is kept in the cache than just the specific data you accessed. If I access intArray[0], then intArray[1] and intArray[2] are likely going to be fetched as well at the same time. This is one of the optimizations that allows the cache to help us work faster. So if I access those three memory locations in a row, it's sort of like having only 1 memory read that you need to wait for.

If you increase the stride, instead accessing intArray[0], then intArray[100] and intArray[200], the data may require 3 separate reads because the second and third memory accesses might not be in cache, resulting in a cache miss.

All of the exact details of your specific problem depend on your computer architecture. I would assume you are running an intel x86-based architecture, but when we are talking about hardware at this low of a level I should not assume (I think you can get Visual Studio to run on other architectures, can't you?); and I don't remember all of the specifics for that architecture anyway.

Because you generally don't know what exactly the caching system will be like on the hardware your software is run on, and it can change over time, it is usually better to just read up on caching principles in general and try to write in general code that is likely to produce fewer misses. Trying to make the code perfect on the specific machine you're developing on is usually a waste of time. The exceptions to this are for certain embedded control systems and other types of low-level systems which are not likely to change on you; unless this describes your work I suggest you just read some good articles or books about computer caches.

Loduwijk
  • 1,950
  • 1
  • 16
  • 28
  • As written in the question, my assumption is that I'll have the same amount of cache misses, _but less cache references_. I've read plenty of things about cache misses, and they all seem to agree that when you access a location in memory, it will fetch and cache the whole line (64 bytes, in my case). This is why I assume that I'll have less cache accesses in the shorter loop (because obviously I access the array less), but the same amount of misses if I jump by the size of the cache line. – Vaillancourt Jul 13 '16 at 16:36
  • Oh. You might want to change your question title then. The whole time I was reading your question I was thinking of it from a point of view of cache *misses* as stated in the title, and which you did ask as well in your closing sentence. Also, since your cache line is the same size as the stride in your loop, I too think your results are odd. I probably can't be any more help here, so hopefully someone else has more insight. – Loduwijk Jul 13 '16 at 16:45
  • You may have more cache references overall for the first version, but the profiler metric you are looking at is _LLC_ references. I.e., references that missed in the higher cache levels. In your benchmark, I'd expect bytes 2 to 64 to all hit in L1 so they won't show up in that LLC stat. – BeeOnRope Jul 29 '16 at 13:42