Why the second program performs worse, even though it should have considerably less cache misses?

Question

Consider the following programs:

#include <stdio.h>
#include <stdlib.h>

typedef unsigned long long u64;

int program_1(u64* a, u64* b)
{
  const u64 lim = 50l * 1000l * 1000l;
  // Reads arrays
  u64 sum = 0;
  for (u64 i = 0; i < lim * 100; ++i) {
    sum += a[i % lim];
    sum += b[i % lim];
  }

  printf("%llu\n", sum);
  return 0;
}


int program_2(u64* a, u64* b)
{
  const u64 lim = 50l * 1000l * 1000l;
  // Reads arrays
  u64 sum = 0;
  for (u64 i = 0; i < lim * 100; ++i) {
    sum += a[i % lim];
  }
  for (u64 i = 0; i < lim * 100; ++i) {
    sum += b[i % lim];
  }

  printf("%llu\n", sum);
  return 0;
}

Both programs are identical: they fill up an array with 1s, then read every element 100 times, adding to a counter. The only difference is the first one fuses the adder loop, while the second one performs two separate passes. Given that M1 has a 64KB of L1 data cache, my understanding is that the following would happen:

Program 1

sum += a[0] // CACHE MISS. Load a[0..8192] on L1.
sum += b[0] // CACHE MISS. Load b[0..8192] on L1.
sum += a[1] // CACHE MISS. Load a[0..8192] on L1.
sum += b[1] // CACHE MISS. Load b[0..8192] on L1.
sum += a[2] // CACHE MISS. Load a[0..8192] on L1.
sum += b[2] // CACHE MISS. Load b[0..8192] on L1.
(...)

Program 2

sum += a[0] // CACHE MISS. Load a[0..8192] on L1.
sum += a[1] // CACHE HIT!
sum += a[2] // CACHE HIT!
sum += a[3] // CACHE HIT!
sum += a[4] // CACHE HIT!
...
sum += a[8192] // CACHE MISS. Load a[8192..16384] on L1.
sum += a[8193] // CACHE HIT!
sum += a[8194] // CACHE HIT!
sum += a[8195] // CACHE HIT!
sum += a[8196] // CACHE HIT!
...
...
sum += b[0] // CACHE MISS. Load b[0..8192] on L1.
sum += b[1] // CACHE HIT!
sum += b[2] // CACHE HIT!
sum += b[3] // CACHE HIT!
sum += b[4] // CACHE HIT!
...

This would lead me to believe that the first program is slower, since every read is a cache miss, while the second one consists majorly of cache hits. The results, though, differ. Running on a Macbook Pro M1, with clang -O2, the first program takes 2.8s to complete, while the second one takes about 3.8s.

What is wrong about my mental model of how the L1 cache works?

Why would you expect the entire L1$ to be replaced on every cache miss? — EOF, Dec 02 '21 at 23:35
@EOF that's what I want to know. Most resources I've read about the subject imply that is what happens. I'm looking for a more precise explanation. — MaiaVictor, Dec 02 '21 at 23:38
No, that's not what any sane person claims. Cache is organized into *cachelines* (typically 64 bytes in size), and filling/evicting/replacing cache entries works at *cacheline* granularity. Also, if you look at the generated [assembly](https://godbolt.org/z/3G1jvr6M9), a plausible reason for the slowdown is easy to see: expensive index calculations. in the first code snippet, `i % lim` can be used for both memory accesses, but in the second snippet it must be recomputed. Depending on your optimization settings, a modulo operation can be very slow indeed. — EOF, Dec 02 '21 at 23:46
I guess a simpler (less fluff) and more representative (matching your compiler and optimization settings) godbolt would be [this](https://godbolt.org/z/ooMoPW1d1). — EOF, Dec 02 '21 at 23:53
@EOF thanks, I've updated the question. That makes much more sense now, and it answers my question. I was missing the fact that the cache is organized in cachelines. Still wondering if there is any more detailed explanation of how exactly cache works (for optimization purposes), because most resources I find don't really cover these details. — MaiaVictor, Dec 03 '21 at 00:17
If you remove the `i % lim` out of your initialization, your program will run faster. — Thomas Matthews, Dec 03 '21 at 00:17
You could have a look at [this](https://stackoverflow.com/a/47714514/3185968) answer on this very site, if you want more guidance on where to learn about caches. — EOF, Dec 03 '21 at 00:29

score 3 · Accepted Answer · answered Dec 03 '21 at 00:37

I'd expect that:

a) while the CPU is waiting for data to be fetched into L1 for the sum += a[i % lim]; it can ask for data to be fetched for the sum += b[i % lim]; into L1. Essentially; Program 1 is waiting for 2 cache misses in parallel while Program 2 is waiting for 1 cache miss at a time and could be up to twice as slow.

b) The loop overhead (all the work in for (u64 i = 0; i < lim * 100; ++i) {), and the indexing (calculating i%lim) is duplicated in Program 2; causing Program 2 to do almost twice as much extra work (that has nothing to do with cache misses).

c) The compiler is bad at optimising. I'm surprised the same code wasn't generated for both versions. I'm shocked that neither CLANG nor GCC managed to auto-vectorize (use SIMD). A very hypothetical idealized perfect compiler should be able to optimize both versions all the way down to write(STDOUT_FILENO, "10000000000\n", 12); return 0.

What is wrong about my mental model of how the L1 cache works?

It looks like you thought the cache can only cache one thing at a time. For Program 1 it would be more like:

sum += a[0] // CACHE MISS
sum += b[0] // CACHE MISS
sum += a[1] // CACHE HIT (data still in cache)
sum += b[1] // CACHE HIT (data still in cache)
sum += a[2] // CACHE HIT (data still in cache)
sum += b[2] // CACHE HIT (data still in cache)
sum += a[3] // CACHE HIT (data still in cache)
sum += b[3] // CACHE HIT (data still in cache)
sum += a[4] // CACHE HIT (data still in cache)
sum += b[4] // CACHE HIT (data still in cache)
sum += a[5] // CACHE HIT (data still in cache)
sum += b[5] // CACHE HIT (data still in cache)
sum += a[6] // CACHE HIT (data still in cache)
sum += b[6] // CACHE HIT (data still in cache)
sum += a[7] // CACHE HIT (data still in cache)
sum += b[7] // CACHE HIT (data still in cache)

sum += a[8] // CACHE MISS
sum += b[8] // CACHE MISS

For program 2 it's probably (see note) the same number of cache misses in a different order:

sum += a[0] // CACHE MISS
sum += a[1] // CACHE HIT (data still in cache)
sum += a[2] // CACHE HIT (data still in cache)
sum += a[3] // CACHE HIT (data still in cache)
sum += a[4] // CACHE HIT (data still in cache)
sum += a[5] // CACHE HIT (data still in cache)
sum += a[6] // CACHE HIT (data still in cache)
sum += a[7] // CACHE HIT (data still in cache)

sum += a[8] // CACHE MISS

..then:

sum += b[0] // CACHE MISS
sum += b[1] // CACHE HIT (data still in cache)
sum += b[2] // CACHE HIT (data still in cache)
sum += b[3] // CACHE HIT (data still in cache)
sum += b[4] // CACHE HIT (data still in cache)
sum += b[5] // CACHE HIT (data still in cache)
sum += b[6] // CACHE HIT (data still in cache)
sum += b[7] // CACHE HIT (data still in cache)

sum += b[8] // CACHE MISS

NOTE: I assumed any array is larger than cache. If cache was large enough to hold an entire array but too small to hold both arrays; then Program 2 would probably be faster than Program 1. This is the only case where Program 2 would be faster.

Thanks for the explanation on what was wrong about my mental model. That (in retrospect, obvious) difference makes it much more sensible. — MaiaVictor, Dec 03 '21 at 01:05

Why the second program performs worse, even though it should have considerably less cache misses?

Program 1

Program 2

1 Answers1