Branch Prediction: What is the BTB eviction scheme used in modern CPUs (Intel skylake for example)?

Question

For branch prediction, the BHT(Branch history table) is indexed by branch virtual address. Aliasing problem happens when two or more branches hash to the same entry in the BHT(Branch history table), and this confliction results in bad prediction accuracy.

The BTB(branch target buffer) is part of the branch history info. So What is the BTB eviction scheme used in modern CPUs (compared to cache eviction)? Will a taken branch always takes a BTB entry immediately and evict an entry if no free in the set?

Actuall, I had did some tests as below.

First, I defined two functions which are N taken forward jmps.

flush_btb:
   0x0000000000001439 <+0>:     endbr64
   0x000000000000143d <+4>:     push   %rbp
   0x000000000000143e <+5>:     mov    %rsp,%rbp
   0x0000000000001441 <+8>:     mov    %edi,-0x4(%rbp)
   0x0000000000001444 <+11>:    jmp    0x1446 <flush_btb+13>
   0x0000000000001446 <+13>:    jmp    0x1448 <flush_btb+15>
   0x0000000000001448 <+15>:    jmp    0x144a <flush_btb+17>
   0x000000000000144a <+17>:    jmp    0x144c <flush_btb+19>
   <repeat 32K times>

test_branches:
   0x00000000000115cb <+0>:     endbr64
   0x00000000000115cf <+4>:     push   %rbp
   0x00000000000115d0 <+5>:     mov    %rsp,%rbp
   0x00000000000115d3 <+8>:     mov    %edi,-0x4(%rbp)
   0x00000000000115d6 <+11>:    jmp    0x115d8 <test_branches+13>
   0x00000000000115d8 <+13>:    jmp    0x115da <test_branches+15>
   0x00000000000115da <+15>:    jmp    0x115dc <test_branches+17>
   0x00000000000115dc <+17>:    jmp    0x115de <test_branches+19>
   <repeat 4K times>

Then I run below code on a i7-7700 CPU (Kaby Lake):

test_branches() # Expect the branches enter BTB.
flush_btb() # Expect most of previous BTB entries are evicted.
test_branches() # Then measure branch-miss event for this function.

I expect the count of branch-miss should be high. But it isn't true.

Branch instructions:  4106
Branch misses:        5 (0.12%)

It seems that on this CPU the prediction for jmp doesn't need BTB info.

I also tried on a i9-12900h CPU (Alder Lake). The flush_btb does have effect and the branch-miss event increases to hundreds compared to no flush_btb.

As mentioned by @PeterCordes, the BR_MISP_RETIRED.ALL_BRANCHES may not count non-indirect jmps. Then, I changed the 'jmp' to 'jne'. The result is different. The branch-miss rate varies from 10% to 83%.

Branch instructions:  4107 
Branch misses:        419 (10.20%)

Branch instructions:  4107 
Branch misses:        2404 (58.53%)

Branch instructions:  4107 
Branch misses:        3216 (78.31%)

Branch instructions:  4107 
Branch misses:        471 (11.47%)

Branch instructions:  4107 
Branch misses:        3438 (83.71%)

If I remove the call to flush_btb(), the branch-miss rate varies from 6% to 64%.

Branch instructions:  4107 
Branch misses:        278 (6.77%)

Branch instructions:  4107 
Branch misses:        819 (19.94%)

Branch instructions:  4107 
Branch misses:        968 (23.57%)

Branch instructions:  4107 
Branch misses:        2631 (64.06%)

Branch instructions:  4108 
Branch misses:        1005 (24.46%)

You don't need branch prediction for an unconditional branch. The prediction is "always taken". — Tim Roberts, Jun 26 '23 at 03:37
@TimRoberts: The front-end needs to know *where* to fetch next, before the block containing the jump instruction is even decoded. (At least the right block, not necessarily the exact address within a block, although that's only an extra 4 address bits out of 48, or 32 for relative branches; some low-power x86 CPUs are slower if you jump more than +-2GiB away. More usefully(?), the source address can be coarser.) See [Slow jmp-instruction](https://stackoverflow.com/q/38811901) for an example of a sequence of `jmp next_instruction` getting slower when you unroll enough to exceed BTB size. — Peter Cordes, Jun 26 '23 at 03:56
@PeterCordes My flush_btb() trys to evict all BTB entries using 32K jmp next_instruction. I think this is enough. but the result is not as expected. — Changbin Du, Jun 26 '23 at 04:29
The `branch-misses` event might only be counting conditional and indirect branches, not just resteers of the front-end after decode. Have a look at `baclears.any` and/or `int_misc.clear_resteer_cycles` — Peter Cordes, Jun 26 '23 at 13:06
Would the indexing method (e.g., not modulo a power of two) make a difference? — , Jun 26 '23 at 15:01
On your Alder Lake, did you pin it to an E-core or a P-core? — Peter Cordes, Jun 26 '23 at 15:21
I changed the branch to JCC, and there's some difference. See updated question. — Changbin Du, Jun 26 '23 at 16:03
Intel since Haswell and AMD since Zen 2 use IT-TAGE predictors. Indexing is based on branch history. (https://inria.hal.science/hal-01100647/document). Intel doesn't publish details about their branch predictors, and IDK if much has been reverse-engineered. For branch *direction* (taken or not), branches can just alias each other. But for target address IDK how that works. — Peter Cordes, Jun 26 '23 at 17:07

Branch Prediction: What is the BTB eviction scheme used in modern CPUs (Intel skylake for example)?

0 Answers0