Cost of a sub-optimal cacheline prefetch

Question

What is the cost of a late prefetch done with a __builtin_prefetch(..., 1) intrinsic (prefetch in preparation for a write)? That is, a prefetch that does not arrive in the L1 cache before the demand load or write that requires it?

For example

void foo(std::uint8_t* line) {
    __builtin_prefetch(line + std::hardware_constructive_interference_size, 1);
    auto next_line = calculate_address_of_next_line(line);
    auto result = transform(line);
    write(next_line, result)
}

In this case if the cost of transform is lower than the prefetch, will this code end up being less efficient than if there were no prefetch? The wikipedia article on cache prefetching talks about an optimal stride for a for loop, but does not mention the impact of a sub-optimal prefetch in that scenario (eg. what would happen if k were too low?).

Does this get pipelined enough that a suboptimal prefetch does not matter? I am only considering Intel x86 (processors around the time of Broadwell maybe) for the purposes of this question.

Too low is better than too high. If your prefetch distance is too low, some of your loads will be cache misses that hit the fill buffer allocated for the PF request. Use `perf stat -e load_hit_pre.sw_pf` to count this on Intel CPUs. *Demand load dispatches that hit L1D fill buffer (FB) allocated for software prefetch*. SW FP for the very next line is just going to waste front-end throughput and load-port throughput running PF instructions. Usually HW prefetch does a good job for sequential access, but SW PF can help avoid slowdowns at 4k boundaries. — Peter Cordes, Feb 22 '19 at 07:30
See also [What Every Programmer Should Know About Memory?](//stackoverflow.com/a/47714514) for a bit more about (not) doing SW prefetch, and that tuning prefetch distance is "brittle" and HW-dependent. — Peter Cordes, Feb 22 '19 at 07:32
*but SW PF can help avoid slowdowns at 4k boundaries* - what does this mean? — Curious, Feb 22 '19 at 07:38
(some levels of?) HW prefetch typically don't/doesn't cross 4k boundaries, because contiguous virtual pages might not be mapped to contiguous physical pages. There are some example questions and/or answers on SO of getting speedups in microbenchmarks with SW prefetch. — Peter Cordes, Feb 22 '19 at 07:39
*SW FP for the very next line is just going to waste front-end throughput and load-port throughput running PF instructions* - also this please. And one thing that I did not specify in the question is - the next cacheline is guaranteed to be a miss. Do you see a prefetch helping in this case? — Curious, Feb 22 '19 at 07:40
`prefetcht1` is an instruction that has to issue and execute, and takes up a spot in the ROB. If the CPU is running that, it's taking execution resources that could be doing real work. So you should remove SW prefetch unless it actually gives a measurable positive benefit in your real full code, with real conditions. And depending on the use-case, measured on your real production server. — Peter Cordes, Feb 22 '19 at 07:42
@PeterCordes I see, does the fact that the cacheline is guaranteed to be missed make a difference in the recommendation against prefetching? Or is this just a "benchmarks are the only way to tell" case? — Curious, Feb 22 '19 at 07:44
How is it guaranteed to be missed? Why can't hardware prefetch find the pattern? But you know the address tens to hundreds of cycles ahead of when the data is needed? SW PF a couple instructions before a demand load is completely useless, especially with OoO exec. Interesting use cases for SW PF include binary search, where you can PF the 1/4 and 3/4 elements before fetching the middle element, so you keep ~2 to 3 loads in flight and maybe cut the effective latency in half vs. a pure data-dependency. (But branchy with correctly-predicted branching can give a similar effect of not waiting.) — Peter Cordes, Feb 22 '19 at 07:52
I was about to say "google it", but I tried and the results are surprisingly bad, even if you add "computer architecture" to it. A demand load is a load where the CPU needs the result. Not a speculative load or prefetch. A demand load is a normal architecturally-visible load, like `mov eax, [rdi]` or the memory-source part of `imul eax, [rdi], 12345`. — Peter Cordes, Feb 22 '19 at 07:56
You should clarify your example. Does `prefetch_next_line` do any prefetching or does it just calculate the address of the next line to prefetch? In the first case, what is the value it is returning? In the second case, which part of the code is doing the prefetching? I presume by a "missed" prefetch you mean a useless prefetch, but I don't see useless prefetches in your example. Also are you asking about the cost of useless software prefetches *in general* or in your example specifically? — Hadi Brais, Feb 22 '19 at 21:37
@HadiBrais Edited the question. By missed prefetch, what I meant is that we do a prefetch, and it doesn't get committed (as in the line does not get loaded in time) for the subsequent store — Curious, Feb 22 '19 at 22:11
So you mean that the prefetched line will be used by an instruction that retires, it is just that the demand access will still miss the line because the prefetch request was issued too late, right? Not that the prefetched line is useless (i.e., not used by a retired instruction)? — Hadi Brais, Feb 22 '19 at 22:16
@Curious: terminology: "committed" isn't the right word here. There isn't really a single word that unambiguously says what you mean, but it's definitely not "commit". Commit is sometimes a synonym for "retire", and a prefetch instruction can do that as soon as the prefetch request is *sent* into the memory hierarchy. SW prefetch has no effect on the *architectural* state, only microarchitectural, so there's nothing to commit. (e.g. a store will commit to L1d when it leaves the store buffer. A load on an in-order CPU commits after reading L1d and write-back to an architectural register.) — Peter Cordes, Feb 23 '19 at 07:01
@PeterCordes Makes sense, unfortunately I cannot edit the comment anymore... — Curious, Feb 23 '19 at 07:03
A key point is that SW PF is just a hint, and can be silently dropped at any level of the memory hierarchy if there's too much traffic already. e.g. if all buffers for outstanding requests from L2 are already allocated for demand-loads. (I know you can't edit old comments, But anyway, I might have said "complete", but you'd still need to explain that you were talking about data arrival if there's room for ambiguity.) — Peter Cordes, Feb 23 '19 at 07:05
write prefetch is very different. First there's a store buffer that decouples stores in the out-of-order core from committing them to cache. Second, regular prefetch only gets the line in MESI Shared state, and an RFO is still needed before you can get it in Exclusive state. But recent x86 does have write prefetch: [What is the effect of second argument in \_builtin\_prefetch()?](//stackoverflow.com/q/40513280) — Peter Cordes, Feb 23 '19 at 18:19
Your edit added a "or write" to the question. Now you're asking about prefetching for write, not just read. I shouldn't have said "very" different, especially if you *are* going to read the line before writing. (i.e. modify in place). That's basically the same as a read prefetch, but a `prefetchw` will RFO in the first place instead of first getting it shared and then having to RFO for write access. — Peter Cordes, Feb 23 '19 at 19:21
@PeterCordes my question got edited to say that I was asking about a prefetch for read. But in fact i was asking about a prefetch for a write not a read (as the code said) so just clarified that. Don’t mind also asking the same question for reads though. I will not be reading that particular line before writing. Just writing after a read on the previous line. — Curious, Feb 23 '19 at 19:29
Is this not in a loop? Then I'm not sure you can usefully prefetch. Well maybe; I think some experiments have shown that the store buffer on Skylake probably won't initiate an RFO until after a store instruction has retired, and that can't happen until after the read + processing. But unless there's some bottleneck from having the store buffer wait longer to commit, that could be ok. — Peter Cordes, Feb 23 '19 at 19:33

BeeOnRope · Accepted Answer · 2019-02-23T14:44:30.367

12

Let's call the type of prefetch you are referring to a late prefetch: where the prefetch does not occur sufficiently before the demand load or store that uses the same cache line to fully hide the latency of the cache miss. This is as opposed to a too-early prefetch, where the prefetch happens so far away from the demand access that it is evicted from at least some levels of the cache before the access occurs.

Compared to not doing the prefetch at all, the cost of such a late prefetch is likely very small, zero or negative.

Let's focus on the negative part: i.e., the scenario where the prefetch helps even though it is late. If I understand your question correctly, you consider a prefetch that doesn't arrive before the load that needs it "missed" or ineffective. That is not case however: as soon as the prefetch request starts, the clocks starts ticking down for the completion of memory access and that work is not lost if the demand load occurs before it completes. For example, if your memory access takes 100 ns, but the demand access occurs only 20 ns after the prefetch, the prefetch is "too late" in the sense that the full 100 ns latency wasn't hidden, but the 20 ns spend on the prefetch is still useful: it reduced the demand access latency to about 80 ns.

That is, late prefetch isn't a binary condition: it ranges from just a little bit late (e.g., a prefetch issued 90 ns before an access with a latency of 100 ns), or really late (almost immediately before the consuming access). In most scenarios even fairly late prefetching probably helps, assuming memory latency was a bottleneck for your algorithm in the first place.

Costs

Let's consider now the case of a totally useless prefetch (i.e., issued immediately before the access, so the access could have been issued in its place had the prefetch not existed) - what is the cost? In most realistic scenarios the costs are probably very small: an extra instruction to handle, some small additional pressure on the AGUs, and perhaps a small amount of wasted effort when matching up the subsequent access with the in-flight prefetch².

Since the assumption is that prefetching is employed because of missed to the outer levels of cache or DRAM, and that the work in the transform function is significant enough to hide some of the latency, the relative cost of this one additional instruction is likely to be very small.

Of course, this is all under the assumption that the additional prefetch is a single instruction. In some cases, you may have had to organize your code somewhat to allow prefetching or perform some duplicate calculations to allow prefetching at the appropriate place. In that case, the cost side could be correspondingly higher.

M and E States

Finally, there is an additional behavior with respect to write accesses and prefetch with write intent, which means that in some scenarios even a totally useless prefetch (i.e., immediately before the first access) is useful - when the first access is a read.

If a given line is first read, then later written, the core may get the line in the E(xclusive) coherence state and then on the first need to make another roundtrip to some level of the cache to get it in the M state. Using a prefetch with write-intent before the first access would avoid this second roundtrip because the line would be brought in with the M state the first time. The effect of this optimization is tough to quantify in general, not least because writes are usually buffered and don't form part of a dependence chain (outside of store forwarding).

² I use the deliberately vague term "wasted effort" here because it isn't really clear if this has a performance or power cost, or is just some additional work that doesn't add to the operation latency. One possible cost is that a load that triggers the initial L1 miss has a special status and can receive its result without making another roundtrip to L1. In the scenario of a prefetch followed immediately by a load, the load presumably doesn't get the special status which may slightly increase the cost. However, this question is about stores not loads.

edited Feb 23 '19 at 14:44

answered Feb 22 '19 at 22:29

BeeOnRope

60,350
16
207
386

Thanks for the informative answer! I have a few followups - *Compared to not doing the prefetch at all, the cost of such a late prefetch is likely very small, zero or negative.* - if this is the case in general, then what is the harm in *always* doing a prefetch when you know the data access patterns for something like a std::transform algorithm on a container? When the container is something like a std::list, access patterns will likely not be determined by the processor. As far as I am aware, no standard library implementation uses a prefetch here. – Curious Feb 23 '19 at 07:00
*However, this question is about loads not stores.* - The question was meant to be about a load to a cacheline followed a store to an adjacent cacheline. The transform function reads data, transforms it and then writes it to the next cacheline. Apologies if that was not clear in the question, I edited the name of the function to "write" to convey the intent better – Curious Feb 23 '19 at 07:01
*The effect of this optimization is tough to quantify in general, not least because writes are usually buffered and don't form part of a dependence chain (outside of store forwarding).* - didn't quite understand most parts of this part. Could you elaborate a little? In my case, I will read an adjacent cacheline, and then write to the one after that without reading from it. – Curious Feb 23 '19 at 07:02
1

@Curious - regarding your first question ("what's the harm?"), you have to look at the condition: the cost of a *late prefetch* is small, zero or negative: because when you assume it's a late prefetch, you already know you are taking a cache miss (slow) and that the subsequent access hit the incoming line, so you probably got some benefit and the worst case scenario is not bad. However, that is not the only possible outcome of a prefetch instruction! You could also have a too early prefetch, which is much worse since you double the number of accesses to outer levels of the memory hierarchy. – BeeOnRope Feb 23 '19 at 14:37
More likely, you could also have a pointless prefetch in the sense that the line is already in the cache: since "caches work" most programs will have L1 hit rates above 90%, and sometimes close to 100%, and in cases where the data is already in L1, the prefetch is pointless. The costs are still small in an absolute sense, but the "denominator" is now also small, because the access is fast, so the actual relative cost may be large. For example, maybe a prefetch instruction costs you 1 cycle due to a front-end effect: in the case the access misses, and you take a 200 cycle cache miss, ... – BeeOnRope Feb 23 '19 at 14:39
1

... this is "very small", relatively. In the case where you are in a tight loop which already only takes 2 cycles however, that's a 50% penalty and is no longer "very small" relatively speaking. Other reasons prefetches are not inserted: compilers are not good at predicting that type of stuff, compilers already move loads early if they can meaning there is often no "earlier" place to put a prefetch, out of order processing already moves loads before computational work, not all CPUs support all prefetch instructions and some behave badly when you prefetch invalid locations, etc. – BeeOnRope Feb 23 '19 at 14:42
Regarding "However, this question is about loads not stores." - I wrote that backwards, I knew it was about stores not loads, but reversed that sentence. BTW, it's better to provide a more fleshed out example, since now we are left guessing what the not-shown functions to: e.g., you ask about a use of __builtin_prefetch, but the code in the question doesn't even even have a __builtin_prefetch call! – BeeOnRope Feb 23 '19 at 14:44
2

@Curious - about stores, what I mean is that loads have a register input (the address) and a register output (the loaded value), so in that sense they are like most ALU instructions, which have input(s) and an output. So it is valid to talk about the latency of a load: if you have a series of loads where input of the N+1th load depends on the output of the Nth (aka pointer chasing), you'll measure the load latency because each load cannot start until the previous one completes. Similarly, any other instructions dependent on the load can't start until it is complete. – BeeOnRope Feb 23 '19 at 14:47
2

So loads that miss slow down all dependent instructions. Stores are not like that: they have two inputs (data to store, address to store), but no register output at all. So no later instructions "depend" on a store in the register dependency sense, and you can't build the equivalent of the pointer chasing test above for stores, and even talking about "store latency" is mostly ill defined. Stores are in a sense: fire and forget: they go into a buffer and can take a long time to get to DRAM (perhaps never if they remain in the cache until you turn off your host), but you don't see that impact. – BeeOnRope Feb 23 '19 at 14:49
So I'm saying that PF for stores could avoid a second roundtrip to change the E to M state, but unlike a load it is hard to say how much that helps because stores are sitting in a buffer anyways, so you can't point at a 10 cycle improvement in the time it takes the store to commit and say it will improve this specific part of the program by 10 cycles. – BeeOnRope Feb 23 '19 at 14:51
Regarding the last part - stores being fire-and-forget, does that mean that it is pointless for me to start a prefetch ahead of time in anticipation for a store? It seems like a store can be perfectly parallelized if what I understand is right and that it will most likely not have an effect on the rest of the program that comes after the store? – Curious Feb 23 '19 at 18:09
Assuming that the thread doing the store will never bother to read the stored value – Curious Feb 23 '19 at 18:16
2

@Curious - it's not pointless, it's just a lot less clear how it works compared to loads. Loads are fairly "simple": if you start a load x cycles early you probably approximately shave x cycles off of the observed load latency which can have a predictable effect on performance (even this is more complicated than it sounds to analyze due to out-of-order and other effects, but still, that's the basic idea). Stores are tougher because the buffer already serves as a prefetch mechanism: the stores enter the store buffer, and "some time later" they finally commit to L1 (this can even be after ... – BeeOnRope Feb 23 '19 at 20:44
... the store instruction has retired). So when there are several stores in the buffer, the CPU knows what lines will be needed in L1 (since the line has to be read into L1 before a store can commit to that line) in the near future, so it can prefetch them - this is the so-called "RFO prefetch" you can read about in very few places. It's different than normal access pattern based prefetching, since it's not making a prediction about what lines are needed, it actually _knows_. Since the store buffer is fairly large and this mechanism exists, SW prefetching for stores maybe doesn't have the ... – BeeOnRope Feb 23 '19 at 20:46
... impact it might otherwise. However, it could still be useful: for one thing, the RFO prefetch may not be as aggressive as you'd want: it may only prefetch say the next 3 upcoming stores, but you may know it is better to prefetch more. You can also do a store prefetch even before the store instruction, to get a head start - but whether it helps at all basically depends on a complicated analysis of say whether you are ever store-buffer bound: i.e., whether the store buffer fills up because it can't commit to L1 fast enough. – BeeOnRope Feb 23 '19 at 20:48
1

I have found store prefetches useful in [cases like this](https://stackoverflow.com/questions/47851120/unexpectedly-poor-and-weirdly-bimodal-performance-for-store-loop-on-intel-skylak) where bringing the line into the L1 with a prefetch solves a performance problem whose cause isn't fully understood, but is related to the way stores are committed from the store buffer. – BeeOnRope Feb 23 '19 at 20:49
Hmmm, one more followup - what if the thread doing the write is never going to read from that location after doing the write? As in - what if the write is meant for another thread? Is there no point in doing the prefetch then? – Curious Feb 23 '19 at 20:56
I don't think it changes the analysis much: in either case the point is to prefetch the stored location to reduce the amount of time the store has to sit in the store buffer. What happens after that (read by another thread, etc) doesn't matter as much. If the writing threads reads from the store, that happens via store-forwarding, at least until the store leaves the store buffer and that's largely independent of prefetching. *However*, if another thread might read unrepredictably from the location, i.e., it is periodically polling the location, then a prefetch might be ill-advised. – BeeOnRope Feb 23 '19 at 21:39
What can happen is you prefetch the location, but then before you do the actual write the other thread reads it, stealing the line back or putting it in S state, so when you do the write you have to do the RFO transaction all over again. In in the case of contention on a line like that you want to be very conservative with your PFs because they can have a significant penalty otherwise. This is true of the hardware "RFO prefetch" as well - it may have predictive mechanisms to avoid this if it sees a line often being stolen away before the store commits. – BeeOnRope Feb 23 '19 at 21:41

Cost of a sub-optimal cacheline prefetch

1 Answers1

Costs

M and E States

Linked

Related