Is there a way to avoid cache misses _completely_?

Question

I read the very basics on how the cache works here: How and when to align to cache line size? and here: What is "cache-friendly" code? , but none of these posts answered my question: is there a way to execute some code entirely within the cache, i.e., without using any access to RAM (beyond perhaps during the initial process of reading the file from the HDD)? As far as I understand the bottleneck in computation nowadays is mostly memory bandwidth, and "as long as you are within the CPU, you are just fine".

Is there a way to load a program into the cache, and keep it there until it terminates? So let's say I have a 1MB compiled C program, which does some scientific computation with a memory requirement of another 1MB, and runs for 5 days. Is there a way to flag this code, so that it does not get out from the cache during evaluation? I am thinking of giving this code higher priority, or alike during execution.

In other words, how much cache is used by an idling computer, which loads its OS (say Ubuntu), and then does nothing? Is there excessive cache use during idling? Should I expect my small program to be always in the cache if the OS does not do anything besides executing it? Let's say after 5 minutes the screensaver starts. Does this lead to massive cache misses (and hence, drastic reduction in performance), since now it competes with my program for the cache space? My experience says that running several non-demanding programs (like the screensaver, or a simple audio player, pdf reader, etc.) at the same time does not significantly decrease the performance of my scientific program, even though I would expect that it would go in-and-out from the cache all the time. The question is: why does not it get its speed affected? Would it make sense to use an absolute minimalistic OS (if so, then which one?) to improve (or rather: maintain) the speed of the computation?

Just for clarity, we can assume that the code is something very simple, say it is a bunch of nested for loops where the innermost part sums up all the increment variables modulo 97. The point is that it is small enough to be put and executed in the cache.

score 6 · Answer 1 · answered Sep 23 '14 at 00:10

6

There are different types of CPU cache misses: compulsory, conflict, capacity, coherence.

Compulsory misses can't be avoided, as they happen on the first reference to a location in memory. So no, you definitely can't avoid cache misses completely.

Besides that, typical L1 cache sizes today are 32KB/64KB per core, and L2 cache sizes are 256KB per core. So 1MB of data would also create either capacity or conflict misses, depending on cache's associativity.

answered Sep 23 '14 at 00:10

chrk

4,037
2
39
47

Interesting link; I have never heard of this classification. I see now that recent Xeons do have 256kb L2 cache per core, while older models used to have 1-3MBs. It seems that smaller L2 but larger L3 is the ``trend'' nowadays. Thanks. – Matsmath Sep 23 '14 at 00:32
Compulsory misses can be kinda hidden by prefetching, performed either in hardware or software. The instruction(s) that induce the load of data from memory into cache won't stall on the resulting memory bus transaction. This of course gives rise to two questions of whether these are cache misses or not: semantically, for whatever purposes one is discussing cache misses; and mechanically, whether a given CPU microarchitecture would increment to associated performance counters. The CPUs may of course vary according to the very limited degrees designers usually offer on counters. – Phil Miller Dec 16 '17 at 15:47

Oliver Charlesworth · Accepted Answer · 2014-09-22T23:39:56.663

3

No, on most standard architectures, CPU cache is not addressable.^*

And even if you could, what kind of performance improvement are you anticipating here? What percentage of your program's execution time do you believe is being spent loading from main memory into (L3) cache? You should profile your program to determine where it's actually spending its time, rather than dreaming up solutions to problems that don't exist!

_{* I think x86 CPUs might have a hardware configuration which allows them to operate without attached RAM, but that's basically irrelevant.}

edited Sep 22 '14 at 23:39

answered Sep 22 '14 at 23:33

Oliver Charlesworth

267,707
33
569
680

then I am missing the point, I am afraid. If you say that loading the program from the main memory to the L3 cache takes insignificant time compared to executing the program itself, then I don't understand how cache misses are an issue in the first place. About your second remark: the program is extremely simple, so I am afraid profiling does not help in this case at all. – Matsmath Sep 22 '14 at 23:44
@Matsmath: That's actually the point. In some cases, the load *is* insignificant, because the program spends the majority of time doing something else. In other cases, the program may be thrashing around with a very large working set, in which case the main-memory loads definitely *are* significant. Take a guess at which category your example falls into! (And yes, you can still profile, e.g. with hardware performance counters). – Oliver Charlesworth Sep 22 '14 at 23:48
This all makes sense. So the answer I was really looking for is that ``as long as the program is simple enough, one may assume that it is (almost) always in the cache, since loading it from RAM is instantaneous. Therefore there is no need for any fancy ways to keep it in the cache permanently.'' Exactly what I was looking for. I will try to accept this as a (short, but convincing) answer. Thanks! – Matsmath Sep 23 '14 at 00:07
what I understand is, when setting stack on RAM, the processor would fetch the stack RAM data to L1 cache and work with it. but when setting STACK on TCM, data would stay always on TCM (there is no fetch from TCM to L1 cache), so there will be always less cache miss on the running program? or I'm mistaken?? – bouqbouq Jul 30 '15 at 08:50

score 2 · Answer 3 · answered Sep 22 '14 at 23:26

2

Short answer: NO. Cache is being maintained by the OS/CPU and it is a bad idea to allow programs to force itself to stay in cache. Lets say you got 2 programs running at the same time, and both are trying to force to stay in the cache, chaos would happen isn't it?

answered Sep 22 '14 at 23:26

Steve

11,696
7
43
81

That's a good point, of course. That is why I have suggested ``prioritizing'' certain parts of a code over others. – Matsmath Sep 22 '14 at 23:53
@Matsmath Well either way the CPU producer/OS wont let you do it. So the answer is no you can't – Steve Sep 22 '14 at 23:57

score 2 · Answer 4 · answered Nov 28 '17 at 02:39

Newer Intel CPUs have added "Cache Allocation Technology" (CAT) under the general rubric of their Resource Director Technology. This allows software directives to reserve certain cache (and other) resources for particular computational units (application, container, VM, etc). So, if the process in question has enough cache space set aside for it under CAT, it should experience only its initial compulsory misses (to bring its code and data into cache) and self-induced conflict misses, avoiding capacity misses and conflict misses created by other processes.

Looks like Intel got interested in my question :-). – Matsmath Nov 28 '17 at 12:29 — Matsmath, Nov 28 '17 at 12:29

score 1 · Answer 5 · answered Sep 23 '14 at 07:07

I am not sure whether it will satisfy your questions.

is there a way to execute some code entirely within the cache, i.e., without using any access to RAM? Is there a way to load a program into the cache, and keep it there until it terminates?

It is possible to use fully associative cache( for eg Tightly coupled memories), which has single cycle access times.(This is realistic only in very small embedded systems).it is a general practise to use TCM's in embedded systems for time critical code as it provides predictability.

In case of partially associative caches it is possible to lock up cache lines or ways (for eg using CP15 in ARM ), so that the eviction algorithm doesn't consider them as a victim for cache fill.

as a side note it is also useful sometimes to use Cache as Ram for Bringup of non booting boards when the caches are in debug mode. (http://www.asset-intertech.com/Products/Processor-Controlled-Test/PCT-Software/Cache-as-RAM-for-board-bring-up-of-non-boothing-ci)

Is there a way to avoid cache misses _completely_?

5 Answers5