1

I was just wondering about how data changes affect the CPU cache.

Let's say I have the following C code:

int main() {
  int arr[16] = ...

  for (int i = 1; i < 16; i++) {
    arr[i] = arr[i] + arr[i-1];
  }

  for (int i = 0; i < 16; i++) {
   arr[i] += arr[i];
  }
}

How many times does the CPU have to reload the numbers in cache because of the memory writes in each of the loops?

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
scasci
  • 31
  • 1
  • 3
  • The performance of a block of code as it relates to a cache depends on many factors, but **cache locality** is probably the most important. – Robert Harvey Jan 08 '23 at 16:29
  • Reloading a recent store should hit in cache with a simple access pattern like this (not also accessing other memory in between that might be a multiple of 4k away, or whatever could alias). Or store-to-load forward from the store buffer if it hasn't even committed to L1d cache yet. In any real-world CPU with a cache, it would normally be big enough to cache 16 `int`s, so you'd expect the 2nd loop to get all L1d cache hits when it runs right after the first loop primed the cache. I wouldn't expect any L1d misses except maybe in the stores to the stack from the `= ...`. – Peter Cordes Jan 08 '23 at 16:47
  • you need to read about TLB and TLB hit and miss – Hackaholic Jan 08 '23 at 18:22

1 Answers1

4

The exact answer depends on the machine-specific details of the cache configuration. The only way to know for sure in general is to measure using the hardware counters and a tool like PAPI.

However in general, writes from a core will update a copy in the L1 cache, so that a subsequent read of the same address later will return the updated copy from cache without a miss (assuming the cache line hasn't been evicted in the interval).

For the code you show (1-d array with 16 4-byte elements), you're only dealing with 64 bytes which is 1 cache line on most modern processors (or 2 depending on alignment), so it's very likely to be loaded into L1 cache at the start when you initialize the elements, and operate in-cache for both loops (assuming there are no other conflicting accesses from other threads).

user28731
  • 188
  • 5
  • Thanks, very helpful. One quick question: is there a chance that each time the values of the array are modified the loaded cache line is actually being invalidated? That would require a memory load at each iteration instead of just one in total. – scasci Jan 08 '23 at 17:15
  • @scasci: No, unless you used x86 `_mm_stream_si32` stores (`movnti`, cache-bypassing: https://www.felixcloutier.com/x86/movnti). Compilers of course won't do that on their own because it would be a performance disaster. A cache that does write-invalidate would be a terrible design, nobody does that. Some CPUs have had write-*through*, updating the data in cache while also passing it through to the next outer level of cache. (In x86, as recent as AMD Bulldozer-family, although that was widely considered a poor design.) – Peter Cordes Jan 08 '23 at 17:34
  • See also [Which cache mapping technique is used in intel core i7 processor?](https://stackoverflow.com/q/49092541) – Peter Cordes Jan 08 '23 at 17:34
  • @PeterCordes With **really stupid** compiler autoparallelization distributing the work to different cores/caches at word granularity, write invalidate would effectively be used. Perhaps a compiler for the old (cacheless) Tera might do something like that? –  Jan 09 '23 at 13:02