Is there automatic L1i cache prefetching on x86?

Question

I looked at the wiki article on branch target predictor; it's somewhat confusing:

I thought the branch target predictor comes into play when a CPU decides which instruction(s) to fetch next (into the CPU pipeline to execute).

But the article mentions some points like this:

Instruction cache fetches block of instructions

Instructions in block are scanned to identify branches

So, does the instruction cache (== L1i I imagine) (pre)fetch instructions based on some branch target prediction data?..

Or is it just that the article implies something other that x86... well, or I misunderstand something

Related: [Slow jmp-instruction](https://stackoverflow.com/q/38811901) re: front end effects, and the fact that branch prediction for the fetch stage needs to happen before the decoders have finished (or started) looking at the previous fetch block to see if there were any branches in it. The branch predictor needs to make a prediction every cycle for what block to fetch next, or stall. (A good guess is to predict the block after the current one, i.e. no taken branches, or none outside this block.) — Peter Cordes, Jun 15 '22 at 10:29
But L1i hardware prefetch from L2 is a separate thing; it might request a line ahead of a code fetch (triggered by branch prediction). L1d cache of course has a hardware prefetcher watching access patterns, but it's possible L1i just waits for a demand miss (from code fetch using addresses generated by branch prediction, so it's speculative and can't fault on a bad access until previous speculation has been confirmed). — Peter Cordes, Jun 15 '22 at 10:32

score 5 · Accepted Answer · edited Jun 15 '22 at 20:13

In the Itanium (not x86 but Intel), there was L1i prefetch and in fact there were L1I_PREFETCH_MISS_RATIO, L1I_PREFETCHES, L2_INST_PREFETCHES, ... performance monitoring events. However, I'm not seeing any L1I prefetch events for Haswell or Skylake. ITLB yes but not L1I. If there was L1I prefetching going on then there would be performance monitoring events measuring this for something like VTune.

You didn't ask for which microarchitecture but I think the lack of performance monitoring events for Haswell+Skylake strongly implies that there is no I-cache prefetching going on for Intel x86_64 cpus in general, only what's actually triggered by the fetch stage, using addresses generated by branch prediction.

There is significant buffering between fetch and decode in recent x86 CPUs, and between decode and rename/allocate into the back end. (See Kanter's Haswell writeup and Skylake on wikichips). So the fetch stage and the front-end in general run far enough ahead of execution to serve a similar purpose to the L1d HW prefetchers for load/store data, but driven by branch prediction instead of sequential access patterns.

Much of the hardware prefetch logic in Intel CPUs is in the L2 cache, which is unified code/data. (And it does look for sequential access patterns). L2 hit latency is low enough not to be a big deal, given the buffering in the pipeline.

I ended up making a bigger edit than I was planning at the start. The stuff I added is more of an explanation of why it's not a problem for code-fetch to only do demand loads (driven by branch prediction), making the branch predictor effectively an L1i prefetcher as well as pulling data into the pipeline. (Hmm, I'm not sure how much memory-level parallelism there is, in outstanding requests between L1i and L2. Does that use LFBs, or would it have its own buffers for incoming lines of code?) — Peter Cordes, Jun 15 '22 at 20:16

Björn Lindqvist · Answer 2 · 2023-05-25T23:24:48.733

All modern processors are pipelined meaning that, under ideal circumstances, they can fetch (at least) one instruction per cycle. It also means that for a pipeline of depth N there can be N instructions in flight that haven't finished yet. Thus, when the processor fetches a branch instruction, preceding instructions may not have finished so it's not known whether the branch should be taken or not. Waiting for the other instructions to finish ("stalling") would incur a large performance penalty. Instead, the processor guesses based on data it has collected for that branch whether the branch should be taken or not. Thus, it can begin executing the next instruction immediately without stalling. This saves a lot of time if most of the processor's guesses are right.

However, if the processor guesses that the branch is taken, it must also know from where the next instruction should be fetched. This information is often encoded in the branch instruction itself, but since the instruction decode stage is one or more cycles after the instruction fetch stage, it is not immediately available when the processor needs it. Therefore, the processor records branch instructions' target addresses in a special data structure called the branch target buffer (BTB).

This is called "speculative execution" since the processor guesses (speculates) on what instructions to execute. If the guesses are wrong the code has to be rollbacked. Branch prediction and instruction prefetching are two parts of this system. It complements caching, but is not the same thing.

In the cycle after a branch instruction is first fetched, the next (block of) instruction(s) is being fetched in parallel with the branch being decoded. So the CPU doesn't yet know there was a branch at all, let alone a conditional one that might depend on earlier instructions. Even without speculative *execution*, instruction fetch itself is speculative if you don't wait for decode after every fetch, to see if it's a branch. — Peter Cordes, May 25 '23 at 17:55
Correct branch prediction (for the block address to fetch next, based on the address of this fetch) is necessary to hide bubbles on taken branches. ([Slow jmp-instruction](https://stackoverflow.com/q/38811901) shows the slowdown when you have too many unconditional direct jumps for the BTB). A misprediction this early, before any wrong-path instructions get executed, only needs to "re-steer" the front-end, not roll-back anything in the back-end. — Peter Cordes, May 25 '23 at 17:56
Both the branch target and prediction buffers are indexed on the branch instruction's address allowing the next instruction's address to be resolved in the fetch stage even for branch instructions. Had it been otherwise correctly predicted branches would also have incurred stalls. A mispredict causes about 15 stall cycles on i7 according to Hennesy & Patternson 5 ed. — Björn Lindqvist, May 25 '23 at 23:21

Is there automatic L1i cache prefetching on x86?

2 Answers2