Inconsistency when benchmarking two contiguous measurements

Question

I was benchmarking a function and I see that some iteration are slower than other.

After some tests I tried to benchmark two contiguous measurements and I still got some weird results.

For me the important part is :

using clock = std::chrono::steady_clock;
// ...
for (int i = 0; i < statSize; i++)
{
    auto t1 = clock::now();
    auto t2 = clock::now();
}

The loop is optimized away as we can see on godbolt.

call std::chrono::_V2::steady_clock::now()
mov r12, rax
call std::chrono::_V2::steady_clock::now()

The code was compiled with:

g++  bench.cpp  -Wall  -Wextra -std=c++11 -O3

and gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) on an Intel® Xeon® W-2195 Processor.

I was the only user on the machine and I try to run with and without higth priority (nice or chrt) and the result was the same.

The result I got with 100 000 000 iterations was:

The Y-axis is in nanoseconds, it's the result of the line

std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()

These 4 lines make me think of: No cache/L1/L2/L3 cache misses (even if the “L3 cache misses” line seems to be too close of the L2 line)

I am not sure why there will be cache misses, may be the storage of the result, but it’s not in the measured code.

I have try to run 10 000 times the program with a loop of 1500, because the L1 cache of this processor is:

lscpu | grep L1 
L1d cache:             32K
L1i cache:             32K

And 1500*16 bits = 24 000 bits which is less than 32K so there shouldn’t have cache miss.

And the results:

I still have my 4 lines (and some noise).

So if it’s realy a cache miss I don’t have any idea why it is happening.

I don’t kown if it’s useful for you but I run:

sudo perf stat -e cache-misses,L1-dcache-load-misses,L1-dcache-load  ./a.out 1000

With the value 1 000 / 10 000 / 100 000 / 1 000 000

I got between 4.70 and 4.30% of all L1-dcache hits, which seem pretty decent to me.

So the questions are:

What is the cause of these slowdown?
How produce a qualitative benchmark of a function when I can’t have a constant time for a No operation ?

Ps : I don’t kwon if I am missing useful informations / flags, feel free to ask !

How reproduce:

The code:

#include <iostream>
#include <chrono>
#include <vector>

int main(int argc, char **argv)
{
    int statSize = 1000;
    using clock = std::chrono::steady_clock;
    if (argc == 2)
    {
        statSize = std::atoi(argv[1]);
    }

    std::vector<uint16_t> temps;
    temps.reserve(statSize);
    for (int i = 0; i < statSize; i++)
    {

        auto t1 = clock::now();

        auto t2 = clock::now();

        temps.push_back(
            std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count());
    }

    for (auto t : temps)
        std::cout << (int)t << std::endl;

    return (0);
}

Build:

g++  bench.cpp  -Wall  -Wextra -std=c++11 -O3

Generate output (sudo needed):

In this case I run 10 000 time the program. Each time take 100 measures, and I remove the first which is always about 5 time slower :
```
 for i in {1..10000} ; do sudo nice -n -17 ./a.out 100 | tail -n 99  >> fast_1_000_000_uint16_100 ; done
```

Generate graph:

cat fast_1_000_000_uint16_100 | gnuplot -p -e "plot '<cat'"

The result I have on my machine:

Where I am after the answer of Zulan and all the comments

The current_clocksource is set on tsc and no switch seen in dmesg, command used:

dmesg -T | grep tsc

I use this script to remove the HyperThreading (HT) then

grep -c proc /proc/cpuinfo
=> 18

Subtract 1 from the last result to obtain the last available core:

=> 17

Edit /etc/grub/default and add isolcpus=(last result) in in GRUB_CMDLINE_LINUX:

GRUB_CMDLINE_LINUX="isolcpus=17"

Finaly:

sudo update-grub
reboot 
// reexecute the script

Now I can use:

taskset -c 17 ./a.out XXXX

So I run 10 000 times a loop of 100 iterations.

for i in {1..10000} ; do sudo /usr/bin/time -v taskset -c 17 ./a.out 100  > ./core17/run_$i 2>&1 ; done

Check if there is any Involuntary context switches:

grep -L "Involuntary context switches: 0" result/* | wc -l
=> 0

There is none, good. Let's plot :

for i in {1..10000} ; do cat ./core17/run_$i | head -n 99 >> ./no_switch_taskset ; done
cat no_switch_taskset | gnuplot -p -e "plot '<cat'"

Result :

There are still 22 measures greater than 1000 (when most values are arround 20) that I don't understand.

Next step, TBD

Do the part :

sudo nice -n -17 perf record...

of the Zulan answer's

The physical cache memory is shared among processes, so it's impossible to avoid cache misses completely (in a preemptive multitasking environment). — molbdnilo, Jun 18 '19 at 08:03
It may have something to do with the scheduling of your OS. Maybe there where some higher priority processes which preemted your process or prevented it from running again. — Yastanub, Jun 18 '19 at 08:05
Ho, I forgot to say that I was the only user and I run with or without `nice` produce the same result, I will edit — Martin Morterol, Jun 18 '19 at 08:05
Even if you *are* the only user, there still are other processes. Try `ps -e` in a terminal to see... — Aconcagua, Jun 18 '19 at 08:14
@MartinMorterol It doesn't matter that you're the only user; there are a whole bunch of processes running concurrently with yours. (My machine has 138 processes with 240 threads in total running, with just a terminal open.) There is, to my knowledge, no way to use `nice` to claim the CPU for a single process. — molbdnilo, Jun 18 '19 at 08:17
Yes, that why I put high priority. I know that not perfect. Is there a way to know if these slow down are caused by other processes ? — Martin Morterol, Jun 18 '19 at 08:18
@DanielLangr I am ok with the call time of `now` (which is about 20nano second on this machine) what bothers me is the fact that there is some x5000 slow down time to time — Martin Morterol, Jun 18 '19 at 08:32
Nano-seconds, the result of `std::chrono::duration_cast(t2 - t1).count()` I will update the post — Martin Morterol, Jun 18 '19 at 08:37
Can you please add a [mcve] that includes how you collect the data during the *benchmark* - that's probably very influential. It's probably fine to just remove what is optimized away from the code at all. Ideally also add the visualization line so we can reproduce. — Zulan, Jun 18 '19 at 08:42
"Yes, that why I put high priority. I know that not perfect" - well, what you have is the vast majority going through very quickly, and a small % that miss (actually a very small % given you can count the misses manually for 10^6 tests). This is perfectly representative of "good but not perfect". — UKMonkey, Jun 18 '19 at 09:26
@UKMonkey, Is there a way too be sure that the cause ? I mean it's a reasonble guess, but, if possible I would like to be sure. — Martin Morterol, Jun 18 '19 at 09:36
@MartinMorterol https://stackoverflow.com/questions/10082517/simplest-tool-to-measure-c-program-cache-hit-miss-and-cpu-time-in-linux — UKMonkey, Jun 18 '19 at 09:38

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

I can't reproduce it with these particular clustered lines, but here is some general information.

Possible causes

As discussed in the comments, nice on a normal idle system is just a best effort. You still have at least

The scheduling tick timer
Kernel tasks that are bound to a certain code
Your task may be migrated from one core to another for an arbitrary reason

You can use isolcpus and taskset to get exclusive cores for certain processes to avoid some of that, but I don't think you can really get rid of all the kernel tasks. In addition, use nohz=full to disable the scheduling tick. You should also disable hyperthreading to get exclusive access to a core from a hardware thread.

Except for taskset, which I absolutely recommend for any performance measurement, these are quite unusual measures.

Measure instead of guessing

If there is a suspicion what could be happening, you can usually setup a measurement to confirm or disprove that hypothesis. perf and tracepoints are great for that. For example, we can start with looking at scheduling activity and some interrupts:

sudo nice -n -17 perf record -o perf.data -e sched:sched_switch -e irq:irq_handler_entry -e irq:softirq_entry ./a.out ...

perf script will now tell list you every occurrence. To correlate that with slow iterations you can use perf probe and a slightly modified benchmark:

void __attribute__((optimize("O0"))) record_slow(int64_t count)
{
    (void)count;
}

...

    auto count = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    if (count > 100) {
        record_slow(count);
    }
    temps.push_back(count);

And compile with -g

sudo perf probe -x ./a.out record_slow count

Then add -e probe_a:record_slow to the call to perf record. Now if you are lucky, you find some close events, e.g.:

 a.out 14888 [005] 51213.829062:    irq:softirq_entry: vec=1 [action=TIMER]
 a.out 14888 [005] 51213.829068:  probe_a:record_slow: (559354aec479) count=9029

Be aware: while this information will likely explain some of your observation, you enter a world of even more puzzling questions and oddities. Also, while perf is pretty low-overhead, there may be some perturbation on what you measure.

What are we benchmarking?

First of all, you need to be clear what you actually measure: The time to execute std::chrono::steady_clock::now(). It's actually good to do that to figure out at least this measurement overhead as well as the precision of the clock.

That's actually a tricky point. The cost of this function, with clock_gettime underneath, depends on your current clocksource¹. If that's tsc you're fine - hpet is much slower. Linux may switch quietly² from tsc to hpet during operation.

What to do to get stable results?

Sometimes you might need to do benchmarks with extreme isolation, but usually that's not necessary even for very low-level micro-architecture benchmarks. Instead, you can use statistical effects: repeat the measurement. Use the appropriate methods (mean, quantiles), sometimes you may want to exclude outliers.

If the measurement kernel is not significantly longer than timer precision, you will have to repeat the kernel and measure outside to get a throughput rather than a latency, which may or may not be different.

Yes - benchmarking right is very complicated, you need to consider a lot of aspects, especially when you get closer to the hardware and the your kernel times get very short. Fortunately there's some help, for example Google's benchmark library provides a lot of help in terms of doing the right amount of repetitions and also in terms having experiment factors.

¹ /sys/devices/system/clocksource/clocksource0/current_clocksource

² Actually it's in dmesg as something like

clocksource: timekeeping watchdog on CPU: Marking clocksource 'tsc' as unstable because the skew is too large:

I have add a part with `taskset` I need to do the ` perf probe` part — Martin Morterol, Jun 18 '19 at 17:35
Hi, sadly I can't set `nohz=full`. If I remove all run with involotary context switch, does it work ? I mean, if I get it, when there is scheduling tick, there should be a context switch too, no ? — Martin Morterol, Jun 19 '19 at 15:02
A scheduling tick does not necessarily imply a `sched_switch`. Unfortunately I don't know a tracepoint event at hand that would reliably and exclusively capture scheduling ticks. I suppose you can somewhat deduce it from regular intervals. — Zulan, Jun 19 '19 at 16:57