Branch Predictor Entries Invalidation upon program finishes?

Question

I am trying to understand when branch predictor entries are invalidated.

Here are the experiments I have done:

Code1:

start_measure_branch_mispred()
while(X times):
 if(something something):
  do_useless()
 endif
endwhile
end_measurement()
store_difference()

So, I am running this code a number of times. I can see that after the first run, the misprediction rates go lower. The branch predictor learns how to predict correctly. But, if I run this experiment again and again (i.e. by writing ./experiment to the terminal), all the first iterations are starting from high misprediction rates. So, at each execution, the branch prediction units for those conditional branches are invalidated. I am using nokaslr and I have disabled ASLR. I also run this experiment on an isolated core. I have run this experiment a couple of times to make sure this is the behavior (i.e. not because of the noise).

My question is: Does CPU invalidate branch prediction units after the program stops its execution? Or what is the cause of this?

The second experiment I have done is:

Code 2:

do:
    start_measure_branch_mispred()
    while(X times):
      if(something something):
        do_useless()
      endif
    endwhile
    end_measurement()
    store_difference()
while(cpu core == 1)

In this experiment, I am running the different processes from two different terminals. The first one is pinned to the core 1 so that it will run on the core 1 and it will do this experiment until I stop it (by killing it). Then, I am running the second process from another terminal and I am pinning the process to different cores. As this process is in a different core, it will only execute the do-while loop 1 time. If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly. If I pin the second process another core which is not the sibling of the first one, then the first iteration of the second process makes higher mispredictions. This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption). So, the second process benefits the trained branch prediction units as they have the same virtual address and map to the same branch prediction unit entry.

As far as I understand, since the CPU is not done with the first process (core 1 process that does the busy loop), the branch prediction entries are still there and the second process can benefit from this. But, in the first one, from run to run, I get higher mispredictions.

EDIT: As the other user asked for the code, here it is. You need to download performance events header code from here

To compile: $(CXX) -std=c++11 -O0 main.cpp -lpthread -o experiment

The code:

#include "linux-perf-events.h"

#include <algorithm>
#include <climits>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <vector>

// some array
int arr8[8] = {1,1,0,0,0,1,0,1};

int pin_thread_to_core(int core_id){            
    int retval;     
    int num_cores = sysconf(_SC_NPROCESSORS_ONLN);      
    if (core_id < 0 || core_id >= num_cores)            
        retval = EINVAL;                                
    cpu_set_t cpuset;                                   
    CPU_ZERO(&cpuset);                                  
    CPU_SET(core_id, &cpuset);                          
    retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
    return retval;
}

void measurement(int cpuid, uint64_t howmany, int* branch_misses){

    int retval = pin_thread_to_core(cpuid);
    if(retval){
        printf("Affinity error: %s\n", strerror(errno));
        return;
    }

    std::vector<int> evts;
    evts.push_back(PERF_COUNT_HW_BRANCH_MISSES); // You might have a different performance event!

    LinuxEvents<PERF_TYPE_HARDWARE> unified(evts, cpuid); // You need to change the constructor in the performance counter so that it will count the events in the given cpuid

    uint64_t *buffer = new uint64_t[howmany + 1];
    uint64_t *buffer_org; // for restoring
    buffer_org = buffer;
    uint64_t howmany_org = howmany; // for restoring

    std::vector<unsigned long long> results;
    results.resize(evts.size());

    do{
        for(size_t trial = 0; trial < 10; trial++) {

            unified.start();
            // the while loop will be executed innerloop times
            int res;
            while(howmany){
                res = arr8[howmany & 0x7]; // do the sequence howmany/8 times
                if(res){
                    *buffer++ = res;
                }       
                howmany--;
            }
            unified.end(results);
            // store misses
            branch_misses[trial] = results[0];
            // restore for next iteration
            buffer = buffer_org;
            howmany = howmany_org;
        }
    }while(cpuid == 5); // the core that does busy loop

    // get rid of optimization
    howmany = (howmany + 1) * buffer[3];
    branch_misses[10] = howmany; // last entry is reserved for this dummy operation

    delete[] buffer;

}
void usage(){
    printf("Run with ./experiment X \t where X is the core number\n");
}
int main(int argc, char *argv[]) {
    // as I have 11th core isolated, set affinity to that
    if(argc == 1){
        usage();
        return 1;
    }

    int exp = 16; // howmany

    int results[11];
    int cpuid = atoi(argv[1]); 

    measurement(cpuid, exp, results);

    printf("%d measurements\n", exp);

    printf("Trial\t\t\tBranchMiss\n");
    for (size_t trial = 0; trial < 10; trial++)
    {
        printf("%zu\t\t\t%d\n", trial, results[trial]);
    }
    return 0;
}

If you want to try the first code, just run ./experiment 1 twice. It will have the same execution as the first code.

If you want to try the second code, open two terminals, run ./experiment X in the first one, and run ./experiment Y in the second one, where X and Y are cpuid's.

Note that, you might not have the same performance event counter. Also, note that you might need to change the cpuid in the busyloop.

Well, then write C. We can't test the branch-predictor on pseudocode. — S.S. Anne, Dec 02 '19 at 17:24
@JL2210 I have added the C code. You need to download the performance event counter. You might also need to modify a line in the performance event counter so that it will only measure that event in the assigned core (line 31 : `const int cpu = -1; ` to a different core) — yzb74714, Dec 02 '19 at 17:46

score 2 · Answer 1 · answered Dec 03 '19 at 19:55

So, I have conducted more experiments to reduce the effect of noise (either from _start until main() functions or from syscalls and interrupts that can happen between two program execution which (syscalls and interrupts) can corrupt the branch predictors.

Here is the pseudo-code of the modified experiment:

int main(int arg){ // arg is the iteration
   pin_thread_to_isolated_core()
   for i=0 to arg:
     measurement()
     std::this_thread::sleep_for(std::chrono::milliseconds(1)); // I put this as it is
   endfor
   printresults() // print after all measurements are completed
}

void measurement(){
   initialization()
   for i=0 to 10:
      start_measurement()
      while(X times) // for the results below, X is 32
        a = arr8[an element] //sequence of 8,
        if(a is odd)
           do_sth()
        endif
      endwhile
      end_measurement()
      store_difference()
   endfor
}

And, these are the results:

For example, I give iteration as 3

Trial           BranchMiss
RUN:1
    0           16
    1           28
    2           3
    3           1
    ....  continues as 1
RUN:2
    0           16   // CPU forgets the sequence
    1           30
    2           2
    3           1
    ....  continues as 1
RUN:3
    0           16
    1           27
    2           4
    3           1
    ....  continues as 1

So, even a millisecond sleep can disturb the branch prediction units. Why is that the case? If I don't put a sleep between those measurements, the CPU can correctly guess, i.e. the Run2 and Run3 will look like below:

RUN:2
    0           1   
    1           1
    ....  continues as 1
RUN:3
    0           1
    1           1
    ....  continues as 1

I believe I diminish the branch executions from _start to the measurement point. Still, the CPU forgets the trained thing.

@HadiBrais I need to do research about this. I have no clue about C-states. I will try to update when I obtain a better background. — yzb74714, Dec 03 '19 at 21:56
@HadiBrais Just one extra information. I tried to execute this code on a non-isolated core. In an isolated core, when I try to sleep with `usleep(100)`, it works (CPU remembers previous iterations). When I try to `usleep(500)`, CPU forgets, probably it is because of the behavior you mentioned above. However, if I execute this code on a non-isolated core, the `usleep(500)` will also remembers the previous training. So, somehow, CPU forgets states quickly if it is an isolated core. I am not sure, I need to run experiments multiple times to reduce the noise and have a better understanding. — yzb74714, Dec 03 '19 at 22:00
@HadiBrais I have added a parameter to the kernel. `GRUB_CMDLINE_LINUX="isolcpus=6,7"` is my parameter and cores 6 and 7 are siblings (in the same physical core). They are isolated from other user processes (other user processes cannot be scheduled on these cores) but they(cores) are not isolated from kernel/OS, I know that. I am using `5.0.0-36-generic` kernel. So, when I am running an experiment on an isolated core, the hyperthreading is also isolated, nothing happens. When I run the code on a non-isolated core, both that core and its sibling is active-> less likely to go deeper C-states. — yzb74714, Dec 03 '19 at 22:25

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

1

Does CPU invalidate branch prediction units after the program stops its execution?

No, the CPU has no idea if/when a program stops execution.

The branch prediction data only makes sense for one virtual address space, so when you switch to a different virtual address space (or when the kernel switches to a different address space, rips the old virtual address space apart and converts its page tables, etc. back into free RAM, then constructs an entirely new virtual address space when you start the program again) all of the old branch predictor data is no longer valid for the new (completely different and unrelated, even if the contents happen to be the same) virtual address space.

If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly.

This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption).

In a perfect world; a glaring security vulnerability (branch predictor state, that can be used to infer information about the data that caused it, being leaked from a victim's process on one logical processor to an attacker's process on a different logical processor in the same core) is not what I'd expect.

The world is somewhat less than perfect. More specifically, in a perfect world branch predictor entries would have "tags" (meta-data) containing which virtual address space and the full virtual address (and which CPU mode) the entry is valid for, and all of this information would be checked by the CPU before using the entry to predict a branch; however that's more expensive and slower than having smaller tags with less information, accidentally using branch predictor entries that are not appropriate, and ending up with "spectre-like" security vulnerabilities.

Note that this is a known vulnerability that the OS you're using failed to mitigate, most likely because you disabled the first line of defense against this kind of vulnerability (ASLR).

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 02 '19 at 19:57

Brendan

35,656
2
39
66

`so when you switch to a different virtual address space(....) all of the old branch predictor data is no longer valid for the new virtual address space.` Well, I know that. That is why I am using ASLR disabled to make sure that the 2 processes have the same virtual address space when being executed. I have inspected using `gdb` and see that the functions have the same VA. I just don't understand why even though ASLR is disabled, the two consecutive execution cannot use the same branch prediction entries. – yzb74714 Dec 02 '19 at 20:10
@yzb74714: They're not the same virtual address space, they're completely different virtual address spaces (that happen to have the same contents). Think of it like web sites, where the HTTP server at IP address 1.2.3.4 happens to provide the same "index.html" as a completely different HTTP server at a completely different IP address. Note that even the contents of the virtual address spaces aren't 100% identical - e.g. the memory used for stacks will be different (and they'd crash if the stacks were the same). – Brendan Dec 02 '19 at 20:16
if they are completely different address spaces, then how can I find a low misprediction rate in the second execution? i.e. if I run the code in a busy loop in core X and the other one in the core Y (X and Y are sibling), the Core Y will give low mispredictions. – yzb74714 Dec 02 '19 at 20:21
@yzb74714: You find a lower misprediction rate because of a security vulnerability. You would find that even if the processes have nothing in common except for having branches at the same virtual address, one will interfere with the other's branch mispredictions (e.g. a process that consists of nothing more than "taken" branches at the right virtual addresses would also interfere with the mispredictions seen by your process). – Brendan Dec 02 '19 at 20:30
I know that. If two branches alias the same virtual address, one can influence the other one. I can understand that if there are two processes running and if they are aliasing, the second one can benefit from that(lower misprediction in my code). But, if I execute them one after another, they still have the same virtual address, but the first execution cannot influence the second one. Why is that the case? – yzb74714 Dec 02 '19 at 20:33
@yzb74714: How many entries does the branch predictor have; and how many branches does the kernel do between terminating one process and starting another? I would expect that the stale branch information from the first process would influence then second process if the kernel's own branches didn't cause the stale information to be replaced with more recent information, and if the kernel doesn't intentionally flush branch predictor data before/when returning from kernel to user-space. However, both pollution and intentional explicit flushing are likely. – Brendan Dec 02 '19 at 20:53
that is what I am guessing. I believe that during a context switch the branch address collision is less likely than explicit/intentional flush to the branch entries. Maybe both AMD and Intel have a bit of information in branch prediction units, which says that whether the current process is running or finished. I guess I need to check for documentation, manuals and possibly patents for that information. – yzb74714 Dec 02 '19 at 20:58
Is ASLR disabled by default on Linux? Can you enable it without recompiling the kernel? – S.S. Anne Dec 02 '19 at 23:57
1

@JL2210 user-space ASLR is enabled by default; there's a sysctl for it. https://linux-audit.com/linux-aslr-and-kernelrandomize_va_space-setting/. Or you can disable it on a per-process basis; GDB does that by default. Only PIE executables can be ASLRed, but most distros build GCC with `--enable-default-pie`. (Shared libs have to be PIC so can always be ASLRed, but non-PIE executables can have absolute addresses hard-coded sometimes without fixup relocations.) – Peter Cordes Dec 03 '19 at 03:24
@yzb74714: Instead of having one BTB entry per branch, modern predictor designs like ITTAGE use global branch *history* as well as the address to index into a table of predictions. https://comparch.net/2013/06/30/why-tage-is-the-best/ So the predictions for a single indirect branch that's reached from different places (like [the main dispatch loop in an interpreter](https://hal.inria.fr/hal-01100647/document)) has its prediction info scattered through the whole BTB. Different branches can kind of "compete" for entries when they alias, but usually that's a minor effect. – Peter Cordes Dec 03 '19 at 03:28
A short context switch might not re-train everything so there's still useful stuff when switching back; flushing would kill that. (Especially just for a kernel interrupt handler). Unfortunately Spectre makes that problematic; the idea that changes in microarchitectural state (e.g. what's hot in L1d cache) can be sensitive / read out by unprivileged userspace means that mispredictions become an attack vector. – Peter Cordes Dec 03 '19 at 03:30
@PeterCordes Oh, great. Well, I might be able to convince my packager to enable this. – S.S. Anne Dec 03 '19 at 11:59
1

@JL2210: Oh, if you want to enable PIE when it's not the default, use `gcc -pie -fPIE` (plus the usual `-O3 -march=native`.) [32-bit absolute addresses no longer allowed in x86-64 Linux?](//stackoverflow.com/q/43367427) explains how to disable it when it *is* the default and says more about it. – Peter Cordes Dec 03 '19 at 12:04
@PeterCordes I know what TAGE does. I have read TAGE, PPM like predictors, TAGE-SC-L, O-Gehl all kinds of stuff. I cannot find information about what Intel uses. Only I can find is `Branchscope` work, which explains that Intel uses simple 2-bit counters for local predictor. AMD made a presentation with its new Ryzen series(3xxx) that they are using TAGE, but which kind (look above for different kinds). I have disabled `ASLR` from the terminal but it is not persistent. After every boot, I need to re-disable it. – yzb74714 Dec 03 '19 at 13:11
@yzb74714: https://hal.inria.fr/hal-01100647/document is pretty good evidence that Haswell uses IT-TAGE ; actual results are close to a simulated IT TAGE. https://danluu.com/branch-prediction/ links that without endorsing the claim. I've been assuming it's true; IIRC Intel has some patents and/or an academic paper on TAGE so it's assumed they use something like that in their HW. possibly related: Matt Godbolt did some branch-prediction testing: https://xania.org/201602/bpu-part-two – Peter Cordes Dec 03 '19 at 13:23
@yzb74714: *After every boot, I need to re-disable it.* - of course; that's how sysctl works. You can edit `/etc/sysctl.d/99-local.conf` or a similar config file to have stuff applied at boot. – Peter Cordes Dec 03 '19 at 13:28
@yzb74714: anyway, my point was that with TAGE, you can't just "clear the entries for that process" because entries don't have a specific owner. Also, it's not unheard of for a computer to re-run the same program again when it finishes (e.g. `for f in *.mp4; do ffmpeg -i $f ...;done`), so your test case is not as artificial as you might imagine. Ignoring Spectre, flushing the BTB after a process terminates is pure downside; it lowers performance if it reruns right away. – Peter Cordes Dec 03 '19 at 13:33
@PeterCordes I guess I cannot explain myself clearly. I am expecting BTB not to be flushed after the program terminates. I also don't expect that all branch entries (actually the ones that I am measuring) are invalidated when the program terminates, or when an interrupt happens in an isolated core. – yzb74714 Dec 03 '19 at 14:11
@PeterCordes What I don't understand is, when the first code runs in a busy loop, it fills the global history with the pattern (8 in my case excluding the outer for loop and while loop conditions). So, when the second code runs at the same time, by the time it comes to the branch that I want to measure, it already executes other branches and pollutes the global branch history. But, it somehow benefits from the branch entries in TAGE which are currently being run on the sibling core in a busy loop. – yzb74714 Dec 03 '19 at 14:13
@PeterCordes But, when I run 2 consecutive execution, the CPU somehow forgets the history of the previous execution. The case 1 and case 2 are similar with respect to polluting the global history (probably running 2 consecutive execution pollute more but it should not be that marginal). I am experimenting 3-4 misprediction if I run 2 codes simultaneously, but 20-30 mispredictions if I run them one after another. – yzb74714 Dec 03 '19 at 14:16
@PeterCordes Btw, I have disabled almost all kernel options related to Spectre and Meltdown to have a better understanding of it. `GRUB_CMDLINE_LINUX_DEFAULT="quiet splash noibrs noibpb nopti nokaslr nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off mitigations=off dis_ucode_ldr"` – yzb74714 Dec 03 '19 at 14:19
@yzb74714: ok, sorry I didn't re-read the question before commenting just now; I forgot what it said if I'd ever read it carefully. Yes, Intel CPUs at least share a single physical BTB between logical cores. I'd assume that startup overhead creates too much pollution; IDK if you've ever looked at how much dynamic-linker code runs before you even reach `main` but it's a *lot*. `strace ./a.out` shows lots of system calls, each of them making kernel calls. I had been guessing that you might get some BPU benefit across sequential executions, but I'm not surprised that there's no benefit. – Peter Cordes Dec 03 '19 at 14:21
@yzb74714: You might have better luck if you write your own minimal `_start`, or just call your function `_start` and exit with an inline asm system call. And statically link it with `gcc -nostdlib -static -fno-pie -no-pie`. You still have kernel and bash code running between executions, but not much else in user-space for those processes. IDK if it would help to have one execve the other directly, avoiding returning to `bash`. Probably also use `-Og` or `-O1` so C++ library functions can inline; less branching for constructors in your code is going to reduce pollution. – Peter Cordes Dec 03 '19 at 14:23
@PeterCordes Ok, I'll try to minimize those calls, etc. Still, to measure those performance counters, I need to read a file descriptor and probably do a bunch of system-calls, but I am doing this anyway at each run of the loop. I should also look for `rdpmr` instruction for less pollution but all I can find is a little information in Intel's forum. – yzb74714 Dec 03 '19 at 14:31
@PeterCordes I think I have found something interesting. I have posted an experiment below. – yzb74714 Dec 03 '19 at 19:56

Hadi Brais · Answer 3 · 2020-03-14T18:10:21.510

TL:DR: power-saving deep sleep states clear branch-predictor history. Limiting sleep level to C3 preserves it on Broadwell. Broadly speaking, all branch prediction state including the BTB and RSB is preserved in C3 and shallower.

For branch history to be useful across runs, it also helps to disable ASLR (so virtual addresses are the same), for example with a non-PIE executable.

Also, isolate the process on a single core because branch predictor entries are local to a physical core on Intel CPUs. Core isolation is not really absolutely necessary, though. If you run the program for many times consecutively on a mostly idle system, you'll find that sometimes it works, but not always. Basically, any task that happens to run on the same core, even for a short time, can pollute the branch predictor state. So running on an isolated core helps getting more stable results, especially on a busy system.

There are several factors that impact the measured number of branch mispredictions, but it's possible to isolate them from one another to determine what is causing these mispredictions. I need to introduce some terminology and my experimental setup first before discussing the details.

I'll use the version of the code from the answer you've posted, which is more general than the one shown in the question. The following code shows the most important parts:

void measurement(int cpuid, uint64_t howmany, int* branch_misses) {
    ...
        for(size_t trial = 0; trial < 4; trial++) {

            unified.start();
            int res;
            for(uint64_t tmp = howmany; tmp; tmp--) {
                res = arr8[tmp & 0x7];
                if(res){
                    *buffer++ = res;
                }
            }
            unified.end(results);
            ...
        }
    ...
}

int main(int argc, char *argv[]) {
    ...
    for(int i = 0; i < 3; ++i) {
        measurement(cpuid, exp, results);
        std::this_thread::sleep_for(std::chrono::milliseconds(1));
    }
    ...
}

A single execution of this program performs multiple sets of measurements of the number of branch mispredictions (the event BR_MISP_RETIRED.ALL_BRANCHES on Intel processors) of the while loop in the measurement function. Each set of measurements is followed by a call to sleep_for() to sleep for 1ms. Measurements within the same set are only separated by calls to unified.start() and unified.end(), which internally perform transitions to kernel mode and back to user mode. I've experimentally determined that it's sufficient for the number of measurements within a set to be 4 and the number of sets to be 3 because the number of branch mispredictions doesn't change beyond that. In addition, the exact location of the call to pin_thread_to_core in the code doesn't seem to be important, which indicates that there is no pollution from the code that surrounds the region of interest.

In all my experiments, I've compiled the code using gcc 7.4.0 -O0 and run it natively on a system with Linux 4.15.0 and an Intel Broadwell processor with hyperthreading disabled. As I'll discuss later, it's important to see what kinds of branches there are in the region of interest (i.e., the code for which the number of branch mispredictions is being measured). Since you've limited the event count to only user-mode events (by setting perf_event_attr.exclude_kernel to 1), you only to consider the user-mode code. But using the -O0 optimization level and C++ makes the native code a little ugly.

The unified.start() function contains two calls to ioctl()but user-mode event are measured only after returning from the second call. Starting from that location in unified.start(), there are a bunch of calls to PLTs (which contain only unconditional direct jumps), a few direct jumps, and a ret at the end. The while loop is implemented as a couple of conditional and unconditional direct jumps. Then there is a call to unified.end(), which calls ioctl to transition to kernel-mode and disable event counting. In the whole region of interest, there are no indirect branches other than a single ret. Any ret or a conditional jump instruction may generate a branch misprediction event. Indirect jumps and calls can also generate misprediction events had they existed. It's important to know this because an active Spectre v2 mitigation can change the state of the buffer used for predicting indirect branches other than rets (called BTB). According to the kernel log, the following Spectre mitigations are used on the system:

Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization Spectre V2 : Mitigation: Full generic retpoline
Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
Spectre V2 : Enabling Restricted Speculation for firmware calls
Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier

The experimental setup described above is the baseline setup. Some of the experiments discussed below use additional compilation options or kernel parameters. First, I've use the intel_idle.max_cstate to limit the deepest Core C-state that the kernel can use. Broadwell supports the following Core C-states: C0, C1, C1E, C3, C6, and C7. I needed to only use to two max_cstate values, namely 3 and 6 so that the kernel doesn't use Core C-states below C3 and C6, respectively. Some experiments were run on a core isolated with the isolcpus kernel parameter. Finally, some experiments use code compiled with the -no-pie option, which disables PIE. All other kernel parameters have the default values. In particular, CPU vulnerability mitigations are always enabled.

The following figure shows the number of mispredictions measured in different configurations. I've followed the following experimental methodology:

Configure the system as required for the experiment to be conducted. Then the system is restarted so that the state of the branch prediction buffers is the same as the one used for other experiments.
The program is run ten consecutive times on the terminal. If isolcpus is used in the configuration, the program is always run on the isolated core.
There are three sets of four measurements in each of the ten runs. The four measurements of the first set of the first run are not shown in the figure because the numbers are the practically same in all configurations. They are basically 15, 6, 3, and 2 mispredictions. These are the training runs for the branch predictor, so it's expected that the number of misprediction will be high for the first measurment and that it will decrease in later measurement as the branch predictor learns. Increasing the number of measurements in the same set doesn't reduce the number of mispredictions any further. The rest of the measurements are plotted in the figure. The 12 bars of each configuration correspond to the 12 measurements performed in a single run in the same order. The numbers are averaged over the ten runs (except that the numbers of the first set of the first run are not included in the average in the first four bars). The label sXmY in the figure refers to the average number of mispredictions over the ten runs for the measurement Y of the set X.

The first configuration is essentially equivalent to the default. The first measurement of the first set indicates whether the branch predictor has retained what it has learned in the previous run of the experiment. The first measurements of the other two sets indicate whether the branch predictor has retained what it has learned in the previous set of measurements in the same run despite of the call to sleep_for. It's clear that the branch predictor has failed to retain this information in both cases in the first configuration. This is also the case in the next three configurations. In all of these configurations, intel_idle.max_cstate is set to 6, meaning that the cpuidle subsystem can choose to put a core into C6 when it has an empty runqueue. This is expected because C6 is power-gating state.

In the fifth configuration, intel_idle.max_cstate is set to 3, meaning that the deepest C-state the kernel is allowed to use is C3, which is a clock-gating state. The results indicate that the branch predictor can now retain its information across calls to sleep_for. Using a tool like strace, you can confirm that sleep_for always invokes the nanosleep system call irrespective of intel_idle.max_cstate. This means that user-kernel transitions cannot be the reason for polluting the branch prediction history in the previous configurations and that the C-state must be the influencing factor here.

Broadwell supports automatic promotion and demotion of C-states, which means that the hardware itself can change the C-state to something different than what the kernel has requested. The results may be a little perturbed if these features are not disabled, but I didn't find this to be an issue. I've observed that the number of cycles spent in C3 or C6 (depending on intel_idle.max_cstate) increases with the number of sets of measurements.

In the fifth configuration, the first bar is as high as in the previous configurations though. So the branch predictor is still not able to remember what it has learned in the first run. The sixth and seventh configurations are similar.

In the eighth configuration, the first bar is significantly lower than in the earlier configurations, which indicates that the branch predictor can now benefit from what it has learned in a previous run of the same program. This is achieved by using two configuration options in addition to setting intel_idle.max_cstate to 3: disabling PIE and running on an isolated core. Although it's not clear from the graph, both options are required. The kernel can randomize the base address of PIE binaries, which changes addresses of all branch instructions. This makes it more likely that the same static branch instructions to map to different branch buffer entries than in the previous run. So what the branch predictor has learned in the previous run is still there in its buffers, but it cannot utilize this information anymore because the linear addresses of the branches have changed. The fact that running on an isolated core is necessary indicates that it's common for the kernel to run short tasks on idle cores, which pollute the branch predictor state.

The first four bars of the eight configuration show that the branch predictor is still learning about one or two branch instructions that are in the region of interest. Actually, all of the remaining branch mispredictions are not for branches in the while loop. To show, the experiments can be repeated on the same code but without the while loop (i.e., there is nothing between unified.start() and unified.end()). This is the ninth configuration. Observe how the number of mispredictions is about the same.

The first bar is still a little higher than the others. Also it seems that there are branches that the branch predictor is having a hard time predicting. The tenth configuration takes -no-pie one step further and disables ASLR completely. This makes the first bar about equal to the others, but doesn't get rid of the two mispredictions. perf record -e cpu/branch-misses/uppp -c 1 can be used to find out which branches are being mispredicted. It tells me that the only branch in the region of interest that is being mispredicted is a branch instruction in the PTL of ioctl. I'm not sure which two branches are being mispredicted and why.

Regarding sharing branch prediction entries between hyperthreads, we know that some of the buffers are shared. For example, we know from the Spectre attack that the BTB is shared between hyperthreads on at least some Intel processors. According to Intel:

As noted in descriptions of Indirect Branch Prediction and Intel® Hyper-Threading Technology (Intel® HT Technology)”, logical processors sharing a core may share indirect branch predictors, allowing one logical processor to control the predicted targets of indirect branches by another logical processor of the same core. . . .
Recall that indirect branch predictors are never shared across cores.

Your results also suggest that the BHT is shared. We also know that the RSB is not shared. In general, this is a design choice. These structures don't have to be like that.

`while(howmany){ ...; howmany--; }` is inside a repeat-loop. Does something reset `howmany` for later iterations, or is that intentionally creating mispredicts by running zero inner iterations for `trial=1..3`? Oh, I see the OP's code uses an extra variable to save `howmany_org` instead of using a tmp for the loop counter. I assume you did the same thing; would be clearer to express with a `for(tmp = howmany; tmp; tmp--){}` loop. — Peter Cordes, Dec 24 '19 at 19:21
I added a TL:DR. I hope it's accurate; the answer is long and is good evidence to back up your conclusions, but having the take-away up front is what most future readers probably want. — Peter Cordes, Dec 24 '19 at 19:36
Interesting that branch history survives a user->kernel transition. I haven't kept up with whether Linux always / never / sometimes uses the microcode-provided MSR write to make later indirect(?) branch prediction independent of previous lower-privileged ones; I know that's slow and I thought it wiped out the whole branch prediction state. Your kernel log messages might show which strategy it's using for Spectre mitigation; that would be a useful addition to your answer. — Peter Cordes, Dec 24 '19 at 19:44
@PeterCordes Thanks for the edit and suggestions. Yes, it appears that the IBPB mitigation only flushes the BTB. Note that there are no indirect jumps in this case. — Hadi Brais, Dec 25 '19 at 03:57
Another branch-prediction details question if you're interested: [Changing irrelevant part of the function changes papi measurement of branch prediction](//stackoverflow.com/q/59506481). No obvious answer occurred to me from reading it; you might be interested in digging deep into the mystery. — Peter Cordes, Dec 28 '19 at 00:56

Branch Predictor Entries Invalidation upon program finishes?

3 Answers3

Linked