2

Suppose I have a trivial C program like this:

void print_character(char);

int main(int argc, char* argv[]){

  char loads_of_text[1024];
  int count = strlen(argv[1]);
  memcpy(loads_of_text, argv[1], count);

  for(int i = 0; i < count; ++i)
    print_character(loads_of_text[i]);

  return 0;
}

As I understand the concept of cache, it's that the processor will fetch more data than neccessary when it is requested to improve performance due to latency when fetching memory. And that this only works if I use memory sequentially until the entire cache line has been read and then it will fetch another.

But im having a hard time visualizing exactly when the processor is fetching and disposing cache lines, where exactly does this occur in an example like the code above? And when is a cache line disposed of?

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
vlind
  • 629
  • 6
  • 16
  • There are a lot of possible strategies, and many are actually used. So this isn't really answerable in general, except by saying that it depends. – harold Oct 01 '16 at 21:19
  • Please note that the cache isn't a concept, it is physical hardware built into the CPU die. And "more data than necessary" isn't strictly correct either, it always fetches cache lines and your data may or may not be smaller than the amount fetched. – UnholySheep Oct 01 '16 at 21:28

2 Answers2

4

Further reading: see the end of this answer for links.

Cache lines are 64B in size, and aligned on 64B boundaries. (Some CPUs may use different size cache lines, but 64B is a very common choice).

Caches load data for two reasons:

  • demand misses: a load or store instruction accessed a byte that wasn't in any of the currently-hot cache lines.

    Near-future accesses to the same byte (or int or whatever) will hit in cache (temporal locality). Near-future access to nearby bytes in the same cache line will also hit (spatial locality). Looping over a multi-dimensional array in the wrong order, or looping over an array of structs accessing only one member, is really bad because the whole cache line has to be loaded but you only use a small part of it.

  • prefetching: after a couple sequential accesses, the hardware prefetcher notices this pattern and starts loading caches lines that haven't been accessed yet, so hopefully there won't be a cache miss when the program does access it. For a specific example of HW prefetcher behaviour, Intel's optimization manual describes the HW prefetchers in various specific CPUs (see the tag wiki for links), and points out that they only operate when the memory system isn't already flooded with demand misses.

    There's also software prefetching: Software can run an instruction that tells the CPU it will access something soon. But the program keeps running even if the memory wasn't ready yet, because it's only a hint. The program doesn't get stuck waiting for a cache miss. Modern hardware prefetching is very good, and software-prefetching is usually a waste of time. It can be useful for something like a binary-search, where you can prefetch both the 1/4 and 3/4 locations before looking at the new middle.

    Do not add software prefetching without testing that it actually speeds up your real code. Even if HW prefetching doesn't do a good job, SW prefetching may not help, or might hurt. Out-of-order execution in general does a lot to hide cache-miss latency.


Usually a cache is "full", and loading a new line requires discarding an old line. Caches usually implement an LRU replacement policy by ordering the tags within a set, so every access to a cache line moves it to the most-recently-used position.

Take an example of an 8-way associative cache:

A 64B chunk of memory can be cached by any one of 8 "ways" in the set that it maps to. Part of the address bits are used as an "index" to select a set of tags. (See this question for an example of splitting an address into tag | index | offset-within-cache-line. The OP is confused about how it works, but has a useful ASCII diagram.)

Hit/miss determination doesn't depend on order. A fast cache (like an L1 cache) will typically check all 8 tags in parallel to find the one (if any) that matches the high bits of the address.

When we need space for a new line, we need to choose one of the 8 current tags to replace (and put the data in its associated 64B storage array). If any are currently in the Invalid state (not caching anything), then the choice is obvious. In the normal case, all 8 tags are already valid.

But we can store additional data with the tags, enough to store an ordering. Every time a cache hit happens, the ordering for the tags within the set is updated to put the line with the hit in the MRU position.

When a new line needs to be allocated, evict the LRU tag, and insert the new line at the MRU position.

The standard LRU policy means that looping over an array that's just slightly too large to fit in cache means you will never see any cache hits, because by the time you come back to the same address, it's been evicted. Some CPUs use a sophisticated replacement policy to avoid this: e.g. Intel IvyBridge's large shared L3 cache uses an adaptive replacement policy that decides on the fly when to allocate new lines in the LRU position, so new allocations evict other recently-allocated lines, preserving lines that do have future value. This takes extra logic, so is only done in large / slow caches, not in the faster L2 and L1 caches.

(Why all 8 tags are usually valid, even at the start of your program:

By the time execution reaches the start of your program, the kernel has already run a bunch of code on the same CPU that will be running your program. Typical modern caches are physically indexed and tagged (the VIPT L1 speed trick avoids aliasing, and is really equivalent to index translation being free, not really to using virtual indexing), so caches don't have to be flushed on context switches. i.e. they cache physical memory, regardless of changes in virtual-to-physical translation caused by page tables.)


You should read Ulrich Drepper's What Every Programmer Should Know About Memory article. IIRC, he goes through the basics of what triggers a load.

Some of Ulrich's specific advice is a bit dated by now; prefetching threads were useful on Pentium 4 but generally aren't anymore. HW prefetchers are smarter now, and Hyperthreading is good enough to just run two full threads.

Wikipedia's CPU Cache article also explains some details about caches, including eviction policy (how the CPU chooses which line to discard when loading a new line).

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

I am having a hard time understanding what this does:

int count = strlen(argv);

You should get a warning. By the way argv[0] is the path to your executable.

First make sure to understand how your code works, and then, and only then please, move to more advanced topics, such as understanding the cache.


To give you the intuition:

Cache will load as much data (of an array for example) as it can, in hope that your program will request it later (because loading something from main memory for example is expensive).

If the cache is full, a victim is selected (with any policy that the cache is implemented with, such as LRU) and gets replaced by the new requested piece of data.

jforberg
  • 6,537
  • 3
  • 29
  • 47
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • It was just hastly scribbled together pseudo code and not an actual working code example. – vlind Oct 01 '16 at 21:24
  • It wasn't mentioned vlind, anyway, hope my answer helps! :) – gsamaras Oct 01 '16 at 21:30
  • Caches can't know that a load came from an array until you've already done a couple sequential accesses. It's only at that point that HW prefetching kicks in. Although IIRC, the spatial prefetcher in Intel CPUs tries to fetch the other 64B of the aligned 128B containing a cache line that was just loaded (if there's spare bandwidth). – Peter Cordes Oct 03 '16 at 12:10
  • That sounds right @PeterCordes, but I was just giving the intuition, any future user shall read your answer for more info! Thanks for the upvote! Nice answer of yours too! :) – gsamaras Oct 03 '16 at 14:34