2

I'm trying to show that the stalls due to branch misprediction may be reduced due to a certain optimization. My intuition suggests this could be due to a reduction in the stall cycles related to loads that delay the branch outcome.

For this, I was planning to use the Linux Perf utility to get the Hardware performance counter values. There is a related metric called branch-load-misses, however, no useful description is provided.

Can anybody please confirm if this is the right metric to use? If not, please suggest a related metric that could be of help.

Thank you

3 Answers3

2

Linux perf has the branches and branch-misses counters, on Intel x86 these map to BR_INST_RETIRED.ALL_BRANCHES and BR_MISP_RETIRED.ALL_BRANCHES which measure all retired branches, and all retired mispredicted branches, respectively.

perf list also includes branch-loads and branch-load-misses but no explanation of what they do. Weirdly, the kernel sources only reference them in the context of PowerPC [1]. On x86, it seems they are just mapped to branches and branch-misses as they return identical values:

$ perf stat -e branches,branch-misses,branch-loads,branch-load-misses -- /bin/true

 Performance counter stats for '/bin/true':

           415,881      branches
             8,787      branch-misses             #    2.11% of all branches
           415,881      branch-loads
             8,787      branch-load-misses

Regarding your original question, keep in mind that the impact of branches comes from two components: the number of mispredicted branches, and the branch resolution time (time to compute the actual branch outcome, which potentially depends on long-latency loads). The former can be measured using the branch-misses event. To quantify the latter, you may be better of with something like TopDown analysis [2].

[1] https://github.com/torvalds/linux/blob/master/arch/powerpc/perf/generic-compat-pmu.c

[2] https://perf.wiki.kernel.org/index.php/Top-Down_Analysis

Wim
  • 11,091
  • 41
  • 58
1

The branch-loads and branch-load-misses events are equal to branch-instructions and branch-misses events.

They are just two events of type PERF_TYPE_HW_CACHE, which is another abstraction for hardware cache events. The BPU is cache-like.

$ strace -e perf_event_open perf stat -e branch-loads,branch-load-misses  ls
perf_event_open({type=PERF_TYPE_HW_CACHE, size=0x88 /* PERF_ATTR_SIZE_??? */, config=PERF_COUNT_HW_CACHE_BPU|PERF_COUNT_HW_CACHE_OP_READ<<8|PERF_COUNT_HW_CACHE_RESULT_ACCESS<<16, ...}, 2512745, -1, -1, PERF_FLAG_FD_CLOEXEC) = 4
perf_event_open({type=PERF_TYPE_HW_CACHE, size=0x88 /* PERF_ATTR_SIZE_??? */, config=PERF_COUNT_HW_CACHE_BPU|PERF_COUNT_HW_CACHE_OP_READ<<8|PERF_COUNT_HW_CACHE_RESULT_MISS<<16, ...}, 2512745, -1, -1, PERF_FLAG_FD_CLOEXEC) = 5

And they are finaly mapped to hw events BR_INST_RETIRED.ALL_BRANCHES and BR_MISP_RETIRED.ALL_BRANCHES.

static __initconst const u64 skl_hw_cache_event_ids
                [PERF_COUNT_HW_CACHE_MAX]
                [PERF_COUNT_HW_CACHE_OP_MAX]
                [PERF_COUNT_HW_CACHE_RESULT_MAX] =
{
 ...
 [ C(BPU ) ] = {
    [ C(OP_READ) ] = {
        [ C(RESULT_ACCESS) ] = 0xc4,    /* BR_INST_RETIRED.ALL_BRANCHES */
        [ C(RESULT_MISS)   ] = 0xc5,    /* BR_MISP_RETIRED.ALL_BRANCHES */
    },
    [ C(OP_WRITE) ] = {
        [ C(RESULT_ACCESS) ] = -1,
        [ C(RESULT_MISS)   ] = -1,
    },
    [ C(OP_PREFETCH) ] = {
        [ C(RESULT_ACCESS) ] = -1,
        [ C(RESULT_MISS)   ] = -1,
    },
 },
 ...
}

More mappings on intel skylake:

  • cache-misses: LONGEST_LAT_CACHE.MISS
  • cache-references: LONGEST_LAT_CACHE.REFERENCE
  • branch-loads: BR_INST_RETIRED.ALL_BRANCHES
  • branch-load-misses: BR_MISP_RETIRED.ALL_BRANCHES
  • L1-dcache-loads: MEM_INST_RETIRED.ALL_LOADS
  • L1-dcache-load-misses: L1D.REPLACEMENT
  • L1-dcache-stores: MEM_INST_RETIRED.ALL_STORES
  • L1-icache-load-misses: ICACHE_64B.IFTAG_MISS
  • LLC-loads: OFFCORE_RESPONSE
  • LLC-load-misses: OFFCORE_RESPONSE
  • LLC-stores: OFFCORE_RESPONSE
  • LLC-store-misses: OFFCORE_RESPONSE
  • dTLB-loads: MEM_INST_RETIRED.ALL_LOADS
  • dTLB-load-misses: DTLB_LOAD_MISSES.WALK_COMPLETED
  • dTLB-stores: MEM_INST_RETIRED.ALL_STORES
  • dTLB-store-misses: DTLB_STORE_MISSES.WALK_COMPLETED
  • iTLB-loads: ITLB_MISSES.STLB_HIT
  • iTLB-load-misses: ITLB_MISSES.WALK_COMPLETED
Changbin Du
  • 501
  • 5
  • 11
0

For measuring the branch prediction and the branch misprediction rate, you can use the VTune Profiler. Download link: https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.bh5zrq

Just create a custom VTune analysis with 2 events:

  • BR_INST_RETIRED.ALL_BRANCHES

  • BR_MISP_RETIRED.ALL_BRANCHES

You'll need to manually divide one by another to get the ratio, though.

BR_INST_RETIRED.ALL_BRANCHES:‌ Counts all (macro) branch instructions retired.

BR_MISP_RETIRED.ALL_BRANCHES: Counts all the retired branch instructions that were miss predicted by the processor.

A branch misprediction occurs when the processor incorrectly predicts the destination of the branch.

When the misprediction is discovered at execution, all the instructions executed in the wrong (speculative) path must be discarded, and the processor must start fetching from the correct path.

In case you don't know - custom analysis in VTune can be created by selecting any pre-defined analysis and pressing the 'Customize...' button in the top-right corner.

E.g., you can select Microarchitecture Exploration, uncheck all checkboxes there, press 'Customize...', scroll down to the table with CPU events and uncheck all not needed/add needed events.

Regards

Parsa
  • 997
  • 1
  • 15
  • 38
  • 2
    Ok, yes those are useful HW events to look at, but what HW event does Linux `perf` map its `branch-loads` and `branch-load-misses` to? You're not answering that part, or the querent's attempt to measure the total misprediction *penalties* rather than just the misprediction *rate*. – Peter Cordes Sep 16 '21 at 11:01
  • (Linux `perf stat` already calculates the miss rate if you ask it to measure both `branches` and `branch-misses`, which I think on Intel HW maps to `br_misp_retired.all_branches` or `br_misp_retired.all_branches_pebs`) – Peter Cordes Sep 16 '21 at 11:02
  • And yes the CPU has to discard uops from the wrong path, but the surrounding code and microarchitectural conditions can have an effect on how many that is, and how much overall throughput that costs. [Avoid stalling pipeline by calculating conditional early](https://stackoverflow.com/q/49932119) discuses one way that branch misses can be cheaper, by *not* having the branch condition be the end of a long dep chain that can't be confirmed for a long time. – Peter Cordes Sep 16 '21 at 11:05