Weird Phenomenon Related to CPU Microarchitecture

Question

I am doing a test on measuring the cost of one function call via pointer, here is my code. However, I found something very weird and look for your help.

The code is compiled under Release mode of VS2017, using default configuration.

There are 4 testbeds and all their OS are Win10. Here are some detailed information:

M1: CPU: i7-7700, Microarchitecture: Kaby Lake
M2: CPU: i7-7700, Microarchitecture: Kaby Lake
M3: CPU: i7-4790, Microarchitecture: Haswell
M4: CPU: E5-2698 v3, Microarchitecture: Haswell

In figures below, the legends are in the form machine parameter_order alias. machine is listed above. parameter_order describes the order of LOOP passed to the program during a single run. alias indicates of which part the time is. no-exec means no function call part, aka. Line 98—108. exec means calling function part, aka. Line 115—125. per-exec is the cost of a function call. All time units are millisecond. per-exec refer to left y-axis, while others refer to right y-axis.

Comparing Fig.1—Fig.4, you could see that the graph may relate to CPU’s microarchitecture (M1’s and M2’s are similar, M3’s and M4’s are similar).

My questions:

Why do all machines have two stage (LOOP < 25 and LOOP > 100)?
Why do all no-exec time have a weird peak when 32 <= LOOP <= 41?
Why do no-exec time and exec time of Kaby Lake machines (M1 and M2) have a discontinuous interval when 72 <= LOOP <= 94?
Why does M4 (Server processor) have a larger variance compared to M3 (Desktop processor)?

Here are my test results:

Fig. 1

Fig. 2

Fig. 3

Fig. 4

For convenience, I also paste code here:

#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <cassert>
#include <algorithm>

#include <windows.h>

using namespace std;

const int PMAX = 11000000, ITER = 60000, RULE = 10000;
//const int LOOP = 10;

int func1(int a, int b, int c, int d, int e)
{
    return 0;
}

int func2(int a, int b, int c, int d, int e)
{
    return 0;
}

int func3(int a, int b, int c, int d, int e)
{
    return 0;
}

int func4(int a, int b, int c, int d, int e)
{
    return 0;
}

int func5(int a, int b, int c, int d, int e)
{
    return 0;
}

int func6(int a, int b, int c, int d, int e)
{
    return 0;
}

int (*init[6])(int, int, int, int, int) = {
    func1,
    func2,
    func3,
    func4,
    func5,
    func6
};
int (*pool[PMAX])(int, int, int, int, int);

LARGE_INTEGER freq;

void getTime(LARGE_INTEGER *res)
{
    QueryPerformanceCounter(res);
}

double delta(LARGE_INTEGER begin_time, LARGE_INTEGER end_time)
{
    return (end_time.QuadPart - begin_time.QuadPart) * 1000.0 / freq.QuadPart;
}

int main()
{
    char path[100], tmp[100];

    FILE *fin, *fout;

    int cnt = 0;

    int i, j, t, r;
    int ans;

    int LOOP;

    LARGE_INTEGER begin_time, end_time;
    double d1, d2, res;

    for(i = 0;i < PMAX;i += 1)
        pool[i] = init[i % 6];

    QueryPerformanceFrequency(&freq);

    printf("file path:");
    scanf("%s", path);

    fin = fopen(path, "r");

start:
    if (fscanf(fin, "%d", &LOOP) == EOF)
        goto end;

    ans = 0;
    getTime(&begin_time);
    for(r = 0;r < RULE;r += 1)
    {
        for(t = 0;t < ITER;t += 1)
        {
            //ans ^= (pool[t])(0, 0, 0, 0, 0);
            ans ^= pool[0](0, 0, 0, 0, 0);
            ans = 0;
            for(j = 0;j < LOOP;j += 1)
                ans ^= j;
        }
    }
    getTime(&end_time);
    printf("%.10f\n", d1 = delta(begin_time, end_time));
    printf("ans:%d\n", ans);

    ans = 0;
    getTime(&begin_time);
    for(r = 0;r < RULE;r += 1)
    {
        for(t = 0;t < ITER;t += 1)
        {
            ans ^= (pool[t])(0, 0, 0, 0, 0);
            ans ^= pool[0](0, 0, 0, 0, 0);
            ans = 0;
            for(j = 0;j < LOOP;j += 1)
                ans ^= j;
        }
    }
    getTime(&end_time);
    printf("%.10f\n", d2 = delta(begin_time, end_time));
    printf("ans:%d\n", ans);

    printf("%.10f\n", res = (d2 - d1) / (1.0 * RULE * ITER));

    sprintf(tmp, "%d.txt", cnt++);

    fout = fopen(tmp, "w");
    fprintf(fout, "%d,%.10f,%.10f,%.10f\n", LOOP, d1, d2, res);
    fclose(fout);

    goto start;
end:

    fclose(fin);

    system("pause");

    exit(0);
}

You said you want to measure the cost of one function call via pointer, but I don't understand how you're doing that. What do you mean by `no function call part` and `function call part`? Both seem to contain function calls. What does this part `(d2 - d1) / (1.0 * RULE * ITER)` calculate? How can the left y axis be negative? What is the point of `LOOP`? What's the difference between M1 and M2? — Hadi Brais, May 23 '18 at 16:35
`LOOP < 25` is where modern Intel CPUs can correctly predict the loop-exit branch for a short loop. Taken 20 times then not-taken once is an easier pattern than taken 24 times and not-taken once. In my testing with simple nested loops, I was finding something like 22 to 23 iterations for the inner (with millions for the outer) was the cutoff for perfect branch prediction. — Peter Cordes, May 23 '18 at 16:46
Your prose doesn't seem aligned with with the code. You say "no-exec means no function call part, aka. Line 98—108. exec means calling function part, aka. Line 115—125.", but as far as I can see the two timing loops either involve one function call (first loop) to the fixed location `pool[0]`, or two function calls one to `pool[0]` and the other calling `pool[t]` for t = 1, 2, 3, ... etc. You say "per-exec" is the value of a function call, so it is the LOOP-normalized value of which of the other two test? — BeeOnRope, May 23 '18 at 18:22
The M4 graph is confusing. If I understand it, `per-exec` is just one of the other two series divided by `LOOP`, but `per-exec` shows both positive and negative deviation from the average time, but the other two series show a strong lower bound right at the average, with only positive deviations. — BeeOnRope, May 23 '18 at 18:24
You should be clearer what your tests are doing. You are testing an indirect function (or possibly 2), running in parallel with a simple xor loop with variable count given by `LOOP`. So increasing `LOOP` values on your chart don't reflect increasing function calls, but only increasing trip counts for your "dummy loop". — BeeOnRope, May 23 '18 at 18:27
@Peter `LOOP` is not a compile-time constant despite it's uppercase naming. Commenting out `ans = 0` doesn't much, despite apparently coupling the iterations together (I think that's the more important effect - the call of the function is mostly independent anyways with only the return value contributing one xor to the and deposit chain: but with the ans = 0 the chain isn't carried so you'd expect the xor loop to be able to run in parallel with itself, but I don't see much effect (I'll try a deeper test later). — BeeOnRope, May 23 '18 at 21:19
BTW my impression of what the OP is trying to do is to measure the time of a (possibly randomized) indirect call parameterized by some other work that can runs interleaved with the call. I think using the associative and commutative property of xor the compiler could actually move the entire xor loop out of all the loops and calculate it once, if it knew that LOOP didn't escape the function. In my test it doesn't escape but in the OPs it does due to scanf. The compiler wasnt that smart though. — BeeOnRope, May 23 '18 at 21:24
@BeeOnRope: Oh right, not a constant. But [there is a closed-form solution for `xor(1..n)`](https://stackoverflow.com/questions/3932346/direct-formula-for-summing-xor). It's not as simple as `n*(n-1)/2` for addition, but only ~9 instructions long (https://godbolt.org/g/Lg8Ff6) and would be worth using if compilers spent time looking for that very rare pattern. (Which they don't and probably shouldn't!) — Peter Cordes, May 23 '18 at 21:51
@BeeOnRope: Your impression is right. The only difference between two loops is `ans ^= (pool[t])(0, 0, 0, 0, 0)`, and that is the part I want to measure the time of. More specifically, `per-exec = (exec - no-exec) / (times of exec)`, aka. `(d2 - d1) / (1.0 * RULE * ITER)`. — Leo Jacob, May 24 '18 at 09:38

score 4 · Answer 1 · answered May 23 '18 at 19:49

Why do all machines have two stage (LOOP < 25 and LOOP > 100)?

The first discontinuity occurs when the innermost loop, for(j = 0;j < LOOP;j += 1), stops predicting its exit correctly. On my machine it occurs when the LOOP hits 24 iterations.

You can see this pretty clearly with perf stat -I3000 to interleave benchmark output with perf stats:

BenchWithFixture/RandomTarget/21     727779 ns     727224 ns       3851   78.6834M items/s
    45.003283831        2998.636997      task-clock (msec)                                           
    45.003283831                118      context-switches          #    0.039 K/sec                  
    45.003283831                  0      cpu-migrations            #    0.000 K/sec                  
    45.003283831                  0      page-faults               #    0.000 K/sec                  
    45.003283831      7,777,209,518      cycles                    #    2.595 GHz                    
    45.003283831     26,846,680,371      instructions              #    3.45  insn per cycle         
    45.003283831      6,711,087,432      branches                  # 2238.882 M/sec                  
    45.003283831          1,962,643      branch-misses             #    0.03% of all branches        
BenchWithFixture/RandomTarget/22     751421 ns     750758 ns       3731   76.2169M items/s
    48.003487573        2998.943341      task-clock (msec)                                           
    48.003487573                111      context-switches          #    0.037 K/sec                  
    48.003487573                  0      cpu-migrations            #    0.000 K/sec                  
    48.003487573                  0      page-faults               #    0.000 K/sec                  
    48.003487573      7,778,285,186      cycles                    #    2.595 GHz                    
    48.003487573     26,956,175,646      instructions              #    3.47  insn per cycle         
    48.003487573      6,738,461,171      branches                  # 2247.947 M/sec                  
    48.003487573          1,973,024      branch-misses             #    0.03% of all branches        
BenchWithFixture/RandomTarget/23     774490 ns     773955 ns       3620   73.9325M items/s
    51.003697814        2999.024360      task-clock (msec)                                           
    51.003697814                105      context-switches          #    0.035 K/sec                  
    51.003697814                  0      cpu-migrations            #    0.000 K/sec                  
    51.003697814                  0      page-faults               #    0.000 K/sec                  
    51.003697814      7,778,570,598      cycles                    #    2.595 GHz                    
    51.003697814     21,547,027,451      instructions              #    2.77  insn per cycle         
    51.003697814      5,386,175,806      branches                  # 1796.776 M/sec                  
    51.003697814         72,207,066      branch-misses             #    1.12% of all branches        
BenchWithFixture/RandomTarget/24    1138919 ns    1138088 ns       2461   50.2777M items/s
    57.004129981        2999.003582      task-clock (msec)                                           
    57.004129981                108      context-switches          #    0.036 K/sec                  
    57.004129981                  0      cpu-migrations            #    0.000 K/sec                  
    57.004129981                  0      page-faults               #    0.000 K/sec                  
    57.004129981      7,778,509,575      cycles                    #    2.595 GHz                    
    57.004129981     19,061,717,197      instructions              #    2.45  insn per cycle         
    57.004129981      4,765,017,648      branches                  # 1589.492 M/sec                  
    57.004129981        103,398,285      branch-misses             #    1.65% of all branches        
BenchWithFixture/RandomTarget/25    1171572 ns    1170748 ns       2391   48.8751M items/s
    60.004325775        2998.547350      task-clock (msec)                                           
    60.004325775                111      context-switches          #    0.037 K/sec                  
    60.004325775                  0      cpu-migrations            #    0.000 K/sec                  
    60.004325775                  0      page-faults               #    0.000 K/sec                  
    60.004325775      7,777,298,382      cycles                    #    2.594 GHz                    
    60.004325775     17,008,954,992      instructions              #    2.19  insn per cycle         
    60.004325775      4,251,656,734      branches                  # 1418.230 M/sec                  
    60.004325775        131,311,948      branch-misses             #    2.13% of all branches

Before the transition, the branch mispredicton rate is about 0.03%, and then it jumps to about 2.13% right when the benchmark slows down, or an increase of two orders of magnitude. The mispredict rate is actually bit lower than what you'd expect: with 25 branches (plus a couple more for the outer loops) you'd expect around 1 / 25 == 4% mispredicts, but we don't see that, not sure why.

On my machine, the first loop (with only the pool[0](0,0,0,0,0) call), like yours, doens't have the transition at ~24 LOOP iterations but why isn't clear to me. My experience is that the TAGE counters usually can't handle constant-iteration loops longer than 24ish cycles, but there may be some interaction with the indirect branch predictor here. It is quite interesting.

Why do all no-exec time have a weird peak when 32 <= LOOP <= 41?

I also experienced this locally. In my test, it was due also to branch mispredictions: when the time spiked, so did the mispredicts. Again, how the prediction is working (so well) here isn't clear, but apparently at these values the algorithm gets prediction failures.

Why do no-exec time and exec time of Kaby Lake machines (M1 and M2) have a discontinuous interval when 72 <= LOOP <= 94?

I experienced the same thing: iteration 72 ran at 28M loops/second, and iteration 73 at only 20M (and subsequent iterations were slow too). Again, the difference can be placed at the feet of branch mispredictions: they increased from 0.01% to 1.35% from iteration 72 to 73. That's almost exactly one misprediction per execution of the outer loop, so likely a mispredict-on-exit.

Why does M4 (Server processor) have a larger variance compared to M3 (Desktop processor)?

Your test is very long, so there is a lot of change to experience various sources of variance such as interrupts, context switches, core frequency changes, and so on.

Since this is a totally different hardware and perhaps software configuration, it is not unsurprising to see unequal variance. You could reduce the number of outer iterations to keep the benchmark shorter and see if the number of outliers reduces (but their magnitude would increase). You can also move timing inside the outer loop, so you are timing a smaller section, and look at a histogram to see how results of various systems are distributed over this smaller interval.

For a deeper look at sources of variance and how to diagnose them, have a look at this answer.

Weird Phenomenon Related to CPU Microarchitecture

1 Answers1