How to use software prefetch systematically?

Question

After reading the accepted answer in When should we use prefetch? and examples from Prefetching Examples?, I still have a lot of issues with understanding when to actually use prefetch. While those answers provide an example where prefetch is helpful, they do not explain how to discover it in real programs. It looks like random guessing.

In particular, I am interested in the C implementations for intel x86 (prefetchnta, prefetcht2, prefetcht1, prefetcht0, prefetchw) that are accessible through GCC's __builtin_prefetch intrinsic. I would like to know:

How can I see that software prefetch can help for my specific program? I imagine that I can collect CPU profiling metrics (e.g. number of cache misses) with Intel Vtune or Linux utility perf. In this case what metrics (or relation between them) indicate the opportunity to improve performance with software prefetching?
How I can locate the loads that suffer from cache misses the most?
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch? As an example, assume that the next loop suffers from cache misses

for (int i = 0; i < n; i++) {
   // some code
   double x = a[i];
   // some code
}

Should I place prefetch before or after the load a[i]? How far ahead it should point a[i+m]? Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache? Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?

Prefetching sometimes helps when there is quite much processing of the fetched data that does saturate the CPU pipeline, like a bunch of ALU operations. The CPU may not see the upcoming fetch operation in its processing window. Prefetching gives the CPU a hint to load next cache line in parallel. Don't expect huge gains. Something like 10% at most. — tstanisl, May 14 '22 at 22:05

Jérôme Richard · Accepted Answer · 2022-05-15T14:16:06.683

How can I see that software prefetch can help for my specific program?

You can check the proportion of cache misses. perf or VTune can be used to get this information thanks to hardware performance counters. You can get the list with perf list for example. The list is dependent of the target processor architecture but there are some generic events. For example, L1-dcache-load-misses, LLC-load-misses and LLC-store-misses. Having the amount of cache misses is not very useful unless you also get the number of load/store. There are generic counters like L1-dcache-loads, LLC-loads or LLC-stores. AFAIK, for the L2, there is no generic counters (at least on Intel processors) and you need to use specific hardware counters (for example l2_rqsts.miss on Intel Skylake-like processors). To get the overall statistics, you can use perf stat -e an_hardware_counter,another_one your_program. A good documentation can be found here.

When the proportion of misses is big, then you should try to optimize the code, but this is just a hint. In fact, regarding your application, you can have a lot of cache hit but many cache misses in critical part/time of your application. As a result, cache misses can be lost among all the others. This is especially true for the L1 cache references that are massive in scalar codes compared to SIMD ones. One solution is to profile only specific portion of your application and use the knowledge of it so to investigate in the good direction. Performance counters are not really a tool to automatically search issues in your program, but a tool to assist you in validating/disproving some hypothesis or to give some hints about what is happening. It gives you evidences to solve a mysterious case but it is up to you, the detective, to do all the work.

How I can locate the loads that suffer from cache misses the most?

Some hardware performance counters are "precise" meaning that the instruction that has generated the event can be located. This is very useful since you can tell which instructions are responsible for the most cache misses (though it is not always precise in practice). You can use perf record + perf report so to get the information (see the previous tutorial for more information).

Note that there are many reasons that can cause a cache misses and only few cases can be solved by using software prefetching.

How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?

This is often difficult to choose in practice and very dependent of your application. Theoretically, the number is an hint to tell to the processor if the level of locality of the target cache line (eg. fetched into the L1, L2 or L3 cache). For example, if you know that data should be read and reused soon, it is a good idea to put it in the L1. However, if the L1 is used and you do not want to pollute it with data used only once (or rarely used), it is better to fetch data into lower caches. In practice, it is a bit complex since the behavior may not be the same from one architecture to another... See What are _mm_prefetch() locality hints? for more information.

An example of usage is for this question. Software prefetching was used to avoid cache trashing issue with some specific strides. This is a pathological case where the hardware prefetcher is not very useful.

Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch?

This is clearly the most tricky part. You should prefetch the cache lines sufficiently early so for the latency to be significantly reduced, otherwise the instruction is useless and can actually be detrimental. Indeed, the instruction takes some space in the program, need to be decoded, and use load ports that could be used to execute other (more critical) load instructions for example. However, if it is too late, then the cache line can be evicted and need to be reloaded...

The usual solution is to write a code like this:

for (int i = 0; i < n; i++) {
   // some code
   const size_t magic_distance_guess = 200;
   __builtin_prefetch(&data[i+magic_distance_guess]);
   double x = a[i];
   // some code
}

Where magic_distance_guess is a value generally set based on benchmarks (or a very deep understanding of the target platform though the practice often shows even highly-skilled developers fail to find the best value).

The thing is the latency is very dependent of where data are coming from and the target platform. In most case, developers cannot really know exactly when to do the prefetching unless they work on a unique given target platform. This makes software prefetching tricky to use and often detrimental when the target platform changes (one has to consider the maintainability of the code and the overhead of the instruction). Not to mention that built-ins are compiler-dependent, prefetching intrinsics are architecture-dependent and there is no standard portable way to use software prefetching.

Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache?

Yes, prefetching instructions are not free and so it is better to use only 1 instruction per cache line (as other prefetching instruction on the same cache line will be useless).

Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?

This is very dependent of the target platform. Modern mainstream x86-64 processors execute instructions in an out-of-order way in parallel and they have a pretty huge window of instruction analyzed. They tends to execute load as soon as possible so to avoid misses and they are often very good for such job.

In your example loop, I expect the hardware prefetcher should do a very good job and using software prefetching should be slower on a (relatively recent) mainstream processor.

Software prefetching was useful when hardware prefetchers was not very smart a decade ago but they tends to be very good nowadays. Additionally, it is often better to guide hardware prefetchers than to use software prefetching instructions since the former have a lower overhead. This is why software prefetching is discouraged (eg. by Intel and most developers) unless you really know what you are doing.

HW prefetchers *can* be disabled via MSRs (for the whole core). Usually only useful for microbenchmark experiments, although perhaps some very specific workloads benefit from selectively disabling some of them. (e.g. possibly the L2 spatial prefetcher if most accesses don't have spatial locality outside the single cache line. Especially if it's causing destructive interference between independent pairs of threads sharing adjacent cache lines.) — Peter Cordes, May 15 '22 at 03:11
SW prefetch is usually not terrible if you do it carefully, and can actually be a bit faster if you have the prefetch distance tuned right for the machine you're running on, with its current load conditions. I definitely don't recommended it for looping over a contiguous array, but with an unrolled loop so you're only executing one prefetch per line it's not much overhead. And I've seen a few examples of people claiming a minor speedup like 10% IIRC on modern CPUs like I think Zen, maybe also some Intel. — Peter Cordes, May 15 '22 at 03:16
But yeah, the key point here is that SW prefetch is almost never the right answer to lots of cache misses looping over an array. It can at best speed things up a few percent by helping the HW prefetchers start in the next page at page boundaries, or not at all. Or at worst, turns many misses into hits by slowing down your loop enough to not ask for data before it arrives. **Much much better is cache-blocking** if you can manage it. See [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311) — Peter Cordes, May 15 '22 at 03:26
@PeterCordes Indeed for disabling the prefetechers, though it require root privileges. Good to know. I removed this part in the answer. For SW prefetch, my experience with it is not great. Most of the time it resulted in no performance change or even slow-down. The very rare case where it was faster (only with NTA so far) required a very fine tuning of the code which was in the end not portable anymore... Not to mention the performance improvement is generally very small (except in very-rare pathological cases like the provided one). — Jérôme Richard, May 15 '22 at 14:23
Yeah, not something I've tried to play with, since tuning is brittle. But HW prefetchers aren't as magical as we might wish. e.g. they stop at the end of a 4k region of physical addresses. (L2 never even sees virtual addresses, and contiguous virtual might not be contiguous physical). The "next page prefetcher" Intel bragged about being new in IvyBridge is I think just a TLB prefetch, not triggering prefetch of the data. So out-of-order exec of loads early enough is still important to get stuff started across a 4k boundary. Fortunately with AVX doing lots per uop, CPU sees far. — Peter Cordes, May 15 '22 at 14:28
@PeterCordes When I saw that Intel developers themselves encourage people not to use SW prefetches but instead to tune the code to work with the HW prefetcher, I mostly stopped trying to use SW prefetching. A good example was for the 2D matrix transposition which indeed benefit a lot from cache-blocking (and register tiling). I also tried to check if the use of SW prefetching in libraries like Numpy and some research projects was justified, but on my (i5-9600KF & older ivybridge) machines I saw no performance decrease in removing them in benchmark made so far. — Jérôme Richard, May 15 '22 at 14:32
I've seen GCC use SW prefetch with an AMD tuning, I think, but not Intel. Maybe only Bulldozer-family, though, not Zen. Current GCC 12.1 `-march=bdver3` will still use SW `prefetcnta` 400 bytes ahead (https://godbolt.org/z/4475Mf7f7), but I don't know how justified that is. `-march=znver2` doesn't try to SW prefetch. So yeah, maybe just older AMD that didn't have smart enough SW prefetch. — Peter Cordes, May 15 '22 at 14:46

score 1 · Answer 2 · answered May 14 '22 at 21:43

1

How to use software prefetch systematically?

The quick answer is: don't.

As you correctly analyzed, prefetching is a tricky and advanced optimisation technique that is not portable and rarely useful.

You can use profiling to determine what sections of code form a bottleneck and use specialized tools such as valgrind to try and identify cache misses that could potentially be avoided using compiler builtins.

Don't expect too much from this, but do profile the code to concentrate your optimizing efforts where it can be useful.

Remember also that a better algorithm can beat an optimized implementation of a less efficient one for large datasets.

answered May 14 '22 at 21:43

chqrlie

131,814
10
121
189

It's not always a better algorithm. I've seen dramatic performance gains by inverting the nesting order of 2D array index loops. – meaning-matters May 14 '22 at 22:14
1

@meaning-matters: Doing operations in a different order could still be considered an algorithmic change, even if it's motivated by cache access pattern. By contrast, just adding prefetching is purely a micro-optimization that doesn't need any work to prove correctness is maintained. Loop inversion does, even though in some cases it's something an optimizing compiler can do for you. (But don't count on it; most compilers that look for loop inversion only do it to "break" / "defeat" one of the SPECcpu2006 benchmarks, and may not find it in cases that aren't almost identical.) – Peter Cordes May 15 '22 at 03:33
@PeterCordes Loop-inversion, among other optimisations, is not what people generally mean when they say that one should choose the most optimal algorithm. Your comment could therefore be seen as a micro-optimisation of my statement ;-) – meaning-matters May 15 '22 at 05:55

How to use software prefetch systematically?

2 Answers2