What data will be cached?

Question

The description of cache in the book is always very general. I am a student in the field of architecture. I want to understand the behavior of the cache in more detail.

In the c/c++ language code, what data will be loaded from the memory to the cache? Will it be loaded into the cache when it is frequently used? For example, when I write a for loop in C language, I often use variables i, j, and k. Will these also be loaded into the cache? C language local variables are generally placed in the stack area, global variables will be placed in the data area? Will these be loaded into the cache first when they are used? Does the data have to go through the cache to reach the register and then to the CPU?

The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?

If you are reading books about programming languages, such as C++, it's not surprising that caches are described in general terms. Program code can benefit from caching, but typically program code has no control over caching. See, for example, [Which part of the computer manages cache replacement?](https://stackoverflow.com/questions/65366046/which-part-of-the-computer-manages-cache-replacement) If you want to know more details about the CPU cache, try books that address CPU design. — JaMiT, Dec 22 '20 at 01:54
This is too broad a question. Various processors use various cache strategies. Processors do not always use a least-recently-used selection for eviction. Another method is a pseudo-least-recently used that is sort of a railroad-switch method. For example, given eight things, imagine there is railroad track with a switch leading to 0 and 1, other track with a switch to 2 and 3, and so on, and then a second layer of switches selecting 0-1 versus 2-3 and 4-5 versus 6-7, and a final switch selection 0-3 versus 4-7. When any item is used, the switches are set to lead to it… — Eric Postpischil, Dec 22 '20 at 01:59
… When it is time to evict, the complements of the switches are followed, and the item there is evicted. Another issue is that items may be loaded into cache even if they are never used: When a processor observes cache lines 34 and 35 (an oversimplification of addressing) are used in succession, it might load 36 in anticipation of it being used. Programs can also make requests regarding cache, such as asking to load data but marking it as “least recently used” so that it is first to be evicted, because the program does not expect to use it again. — Eric Postpischil, Dec 22 '20 at 02:01
I know lru and pseudo lru (tlru). In fact, I want to know what data is loaded into the cache. You just mentioned cache replacement. Where can I find a detailed description of cacheline? For example, cacheline has a valid bit and lru status bit? Where can I get more detailed information? Thanks — Gerrie, Dec 22 '20 at 02:03
Different processors implement cache differently. These details are processor-specific. — Eric Postpischil, Dec 22 '20 at 02:04
For example, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz. Can I check its information directly on my server? — Gerrie, Dec 22 '20 at 02:05

Peter Cordes · Answer 1 · 2020-12-22T02:29:42.050

Normally all the memory your C++ program uses (code and data) is in cacheable memory.

Any access (read or write) to any C++ object¹ will result in the cache line containing it being hot in case, assuming a normal CPU cache: set-associative, write-back / write-allocate¹, even if it was previously not hot.

The simplest design is that each level of cache fetches data through the next outer layer, so after a load miss, data is hot in all levels of cache. But you can have outer caches that don't read-allocate, and act as victim caches. Or outer levels that are Exclusive of inner caches, to avoid wasting space caching the same data twice (https://en.wikipedia.org/wiki/Cache_inclusion_policy). But whatever happens, right after a read or write, at least the inner-most (closest to that CPU core) level of cache will have the data hot, so accessing it again right away (or an adjacent item in the same cache line) will be fast. Different design choices affect the chances of a line still being hot if the next access is after a bunch of other accesses. And if hot, which level of cache you may find it in. But the super basics are that any memory compiler-generated code touches ends up in cache. CPU cache transparently caches physical memory.

Many cache lines can be hot at the same time, not aliasing each other. i.e. caches have many sets. Some access patterns are pessimal, like multiple pointers all offset from each other by 4k which will make all accesses alias the same set in L1d cache, as well as sometimes having extra 4k-aliasing penalties in the CPU's memory disambiguation logic. (Assuming a 4k page size like on x86). e.g. L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes - memory performance effects can get very complicated. Knowing some theory is enough to understand what's generally good, but the exact details can be very complex. (Sometimes even for true experts like Dr. Bandwidth, e.g. this case).

Footnote 1: Loosely, An object is a named variable or dynamically allocated memory pointed to by a pointer.

Footnote 2: Write-back cache with a write-allocate policy is near universal for modern CPUs, also a pseudo-LRU replacement policy; see wikipedia. A few devices have access patterns that benefit from caches that only allocate on read but not write, but CPUs benefit from write-allocate. A modern CPU will almost always have a multi-level cache hierarchy, with each level being set-associative with some level of associativity. Some embedded CPUs may only have 1 level, or even no cache, but you'd know if you were writing code specifically for a system like that.

Modern large L3 caches sometimes use a replacement policy.

Of course, optimization can mean that some local variables (especially loop counters or array pointers) can get optimized into a register and not exist in memory at all. Registers are not part of the CPU cache or memory at all, they're a separate storage space. People often describe things as "compiler caches the value in a register", but do not confuse that with CPU cache. (related: https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ and When to use volatile with multi threading?)

If you want to see what the compiler is making the CPU do, look at the compiler's asm output. How to remove "noise" from GCC/clang assembly output?. Every memory access in the asm source is an access in computer-architecture terms, so you can apply what you know about cache state given an access pattern to figure out what will happen with a set-associative cache.

Also related:

Which cache mapping technique is used in intel core i7 processor?
Modern Microprocessors: A 90-Minute Guide!
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? - why we have multi-level caches, and some real numbers for the cache heirarchies of Haswell (like Skylake) and Piledriver (fortunately obsolete, but an interesting example of a possible design).

A few of the things I said like "right after a write" are simplifications: out-of-order execution and the store buffer mean that right after commit to L1d cache is when the store truly happens (and the line has to be hot in cache). Not when the store executes in the store exec units. — Peter Cordes, Dec 22 '20 at 05:51

Andreas Wenzel · Answer 2 · 2020-12-22T02:34:17.860

Generally, the most recently used cache lines will be stored in the cache. For short loops, loop counter variables are normally stored in a CPU register. For longer loops, loop counter variables will probably be stored in the cache, unless one loop iteration runs for such a long time that the loop counter gets evicted from the cache due to the CPU doing other work.

Most variables will generally be cached after the first access (or beforehand if the cache prefetcher does a good job), irrespective of how often they are used. A higher frequency of usage will only prevent the memory from being evicted from the cache, but won't influence it being cached in the first place. However, some CPU architectures offer so-called non-temporal read and write instructions, which bypass the cache. These instructions are useful if the programmer knows in advance that a memory location will only be accessed once, and therefore should not be cached. But generally, these instructions should not be used, unless you know exactly what you are doing.

The CPU cache does not care whether variables are stored on the heap or stack. Memory is simply cached according to a "most recently used" algorithm, or, to be more accurate, the cache is evicted based on a "least recently used" algorithm, whenever new room in the cache is required for a new memory access.

In the case of local variables stored on the stack, there is a high chance that the cache line of that variable is already cached due to the program using that stack cache line recently for something else. Therefore, local variables generally have good cache performance. Also, the cache prefetcher works very well with the stack, because the stack grows in a linear fashion.

The pointer variable p stores the address of the data. If I use the pointer *p to access a variable. Will p be loaded into the cache first, and then *p will be loaded into the cache?

Yes, first, the cache line containing p will be loaded into the cache (if it is not already cached or stored in a CPU register). Then, the cache line containing *p will be loaded into the cache.

Thank you for your answer. In your answer, point out: the most recently used cache lines will be stored in the cache. Does this mean that some data (not frequently accessed) will not be loaded into the cache? In your answer below, it is said that any data is basically loaded into the cache when it is accessed for the first time. Does this mean that the data must pass through the cache to reach the register and then to the CPU? — Gerrie, Dec 22 '20 at 04:44
@cyj: "Does this mean that some data (not frequently accessed) will not be loaded into the cache?" -- All memory accesses will be loaded into cache (unless non-temporal instructions are being used, which is rare). The frequency of memory access only affects how long a cache line remains in the cache. A cache line which is rarely used has a higher chance of being evicted, because the cache evicts according to a "least recently used" algorithm. — Andreas Wenzel, Dec 22 '20 at 06:00
@cyj: "Does this mean that the data must pass through the cache to reach the register and then to the CPU?" -- I don't know how CPUs internally handle cache misses. Theoretically, they could wait for the memory to be loaded into the CPU cache, and then read from the CPU cache. But my guess is that it probably is faster not to wait for the CPU cache in this case. CPUs probably have special circuitry for handling cache misses, so that the CPU can access the memory data without waiting for the CPU cache to update. However, I am just guessing. The answer probably depends on the CPU architecture. — Andreas Wenzel, Dec 22 '20 at 06:23
@cyj: Actually, in a multi-CPU environment, the answer is a bit more complicated, as [cache coherency](https://en.wikipedia.org/wiki/Cache_coherence) must also be taken into account. It is therefore possible that the main memory's version of a cache line contains stale (out of date) memory data, and that the actual data must be (indirectly) read from the cache of one of the other CPUs. — Andreas Wenzel, Dec 22 '20 at 09:34

What data will be cached?

2 Answers2

Linked