Why is there such a high variability in runtime for adding two values?

Question

I wrote a timing function that records the run time of a function and calculates the mean and the standard deviation over multiple runs. I was surprised to find very high standard deviations, even for seemingly simple tasks such as adding two doubles. I analysed the data in python (see the plots). The c++ output was 19.6171 ns +/- 21.9653ns (82799807 runs) when compiled with:

gcc version 8.3.0 (Debian 8.3.0-19)
/usr/bin/c++ -O3 -DNDEBUG -std=gnu++17

The whole test was done on my personal computer, which was not idle but running a DE, a browser, my IDE and other processes. There was free RAM available during the test though. My double core CPU with HT was idling below 10% usage.
Is a spike from an average value of 20 ns to 50 µs to be expected for this situation?

Plot of run times
This is the content of std::vector<double> run_times. I don't see any pattern.

Histogram of run times
Note log y axis (number of samples in this bin).

timing.h

#include <cstdint>
#include <ostream>
#include <cmath>
#include <algorithm>
#include <vector>
#include <chrono>
#include <numeric>
#include <fstream>


struct TimingResults{
    // all time results are in nanoseconds
    double mean;
    double standard_deviation;
    uint64_t number_of_runs;
};


std::ostream& operator<<(std::ostream& os, const TimingResults& results);


template <typename InputIterator>
std::pair<typename InputIterator::value_type, typename InputIterator::value_type> 
calculate_mean_and_standard_deviation(InputIterator first, InputIterator last){
    double mean = std::accumulate(first, last, 0.) / std::distance(first, last);
    double sum = 0;
    std::for_each(first, last, [&](double x){sum += (x - mean) * (x - mean);});
    return {mean, std::sqrt(sum / (std::distance(first, last) - 1))};
}


template<uint64_t RunTimeMilliSeconds = 4000, typename F, typename... Args>
TimingResults measure_runtime(F func, Args&&... args){
    std::vector<double> runtimes;
    std::chrono::system_clock::time_point b;
    auto start_time = std::chrono::high_resolution_clock::now();
    do {
        auto a = std::chrono::high_resolution_clock::now();
        func(std::forward<Args>(args)...);
        b = std::chrono::high_resolution_clock::now();
        runtimes.push_back(std::chrono::duration_cast<std::chrono::nanoseconds>(b - a).count());
    } while (std::chrono::duration_cast<std::chrono::milliseconds>(b-start_time).count() <= RunTimeMilliSeconds);
    auto [mean, std_deviation] = calculate_mean_and_standard_deviation(runtimes.begin(), runtimes.end());
    return {mean, std_deviation, runtimes.size()};
}

timing.cpp

#include <iostream>
#include "timing.h"


std::ostream& operator<<(std::ostream& os, const TimingResults& results){
    return os << results.mean << " ns" << " +/- " << results.standard_deviation << "ns ("
    << results.number_of_runs << " runs)";
}

main.cpp

#include "src/timing/timing.h"
#include <iostream>


int main(){
    auto res = measure_runtime([](double x, double y){return x * y;}, 6.9, 9.6);
    std::cout << res;
}

In short, the variation is probably because of your approach of measuring timing and capturing results. Your loop is doing time checks - including as part of the loop condition - and pushing values into a vector (which does dynamic memory allocation) on every iteration. Accesses of a clock to check timing will have jitter unless your code is being executed on a hard real time platform. Dynamic memory allocation and deallocation (which will happen eventually if enough elements are pushed to a vector) is also non-deterministic on most systems, so the time taken will vary. — Peter, Aug 28 '19 at 12:12
I don't think `vector`'s memory allocations have anything to do with it. They happen outside the measurement interval (given by the two calls to `now()`). — Jonas Greitemann, Aug 28 '19 at 12:20
@Peter: on modern x86, (like the OP's Debian system with an Intel CPU), the timesource for `now()` should be a RDTSC instruction in user-space (with scale factors exported by the kernel via the VDSO page). There's still quite a lot of instructions that have to run if you single-step into it, but nowhere near as slow as a system call. I wouldn't really expect jitter in the clock itself, just in the code that accesses it. The TSC ticks at near the reference frequency (not max turbo) so it does have ~0.25ns precision on a 4GHz CPU. [related](//stackoverflow.com/q/13772567) — Peter Cordes, Aug 28 '19 at 12:26
@Peter Yes, I kept only the function between the measuring intervals `a` and `b`. The other timing code just limits the total runtime to 4 seconds. — Olaf, Aug 28 '19 at 12:27
And yes @Olaf: spikes are totally expected because of interrupts. Even "hard realtime" doesn't mean dependable timing down to the clock-cycle. It just puts a cap on worst-case latency. 50 us still sounds reasonable for most hard-realtime systems. — Peter Cordes, Aug 28 '19 at 12:28
@Jonas: a system call to allocate new memory may push this loop and/or the `now()` function out of uop cache and even L1-instruction cache, and maybe the iTLB. And also (with Spectre mitigation) make branch-prediction cold. Although you'd expect most of the cost to be before the `rdtsc` in the start-the-clock `now()`, i.e. outside the timed interval. — Peter Cordes, Aug 28 '19 at 12:38

score 7 · Answer 1 · answered Aug 28 '19 at 12:14

7

Modern CPUs easily perform on the order of several 10^9 FLOPS, i.e. the expected time for one operation is below 1 ns. This, however refers to peak-performance. For most real-world workloads, the performance is going to be much less, owing to memory and cache effects.

The problem with your benchmark is that you are timing individual operations. The overhead of getting the time points a and b likely simply exceeds the time you are actually trying to measure. Additionally, even std::chrono::high_resolution_clock is not going to give you picosecond accuracy (though that's in principle implementation and hardware depended). The obvious fix is to perform the operation N times, time that and then divide the total time by N. At some point, you'll see that your results become consistent. (Feel free to post your results.)

TL;DR: You are trying to time a lightning bolt with a pocket watch.

answered Aug 28 '19 at 12:14

Jonas Greitemann

1,011
10
25

1

More importantly, out-of-order execution can overlap one FP add with the timing overhead. **The `return x*y` result is also never used, so probably optimized away after `func` inlines into the timing loop**, or at least hoisted out of the loop. We literally can't tell from looking at the timing results! Also the inputs are compile-time constants, so there's no need to eval it at runtime. – Peter Cordes Aug 28 '19 at 12:20
@PeterCordes Two very good points. Optimizations should probably be turned off to avoid this. (Or look at the assembly to check work is actually being done). Out-of-order execution should be less of an issue once multiple operations are executed in a loop before taking the time. – Jonas Greitemann Aug 28 '19 at 12:25
1

No, disabling optimizations would turn this into a microbenchmark of call/ret through a nest of C++ functions, and maybe store-forwarding latency. Benchmarking with optimizations disabled is useless. Typically you need to use inline asm to force the compiler to materialize a value in a register repeatedly in a loop, and/or forget what it knows about a variable's value to make it redo a calculation instead of hoisting it. e.g. see ["Escape" and "Clobber" equivalent in MSVC](//stackoverflow.com/q/33975479) (not the MSVC part, just the part in the question showing useful GNU C inline asm). – Peter Cordes Aug 28 '19 at 12:31
1

And yes usually you should put stuff in a repeat loop. You have to choose whether you want to measure throughput or latency (by making the operations dependent on the previous or not). **Microbenchmarking is *hard***, and for this simple an operation doesn't make any sense at the C++ level, only when you also consider the asm for the specific CPU it will run on. – Peter Cordes Aug 28 '19 at 12:33

score 3 · Answer 2 · answered Aug 28 '19 at 12:51

TL:DR: Your entire approach is too simplistic to tell you anything useful. Timing overhead would dominate even if your multiply wasn't optimized away.

Microbenchmarking is non-trivial even in hand-written asm. It's impossible in C++ if you don't understand how your C++ compiles to asm for your target platform, for an operation as simple / cheap as x * y.

You aren't using the result, so maybe you were trying to measure throughput (instead of latency). But with only one multiply inside the timed interval, there's no chance for superscalar / pipelined execution to happen.

Even more fundamentally, you don't use the result so there's no need for the compiler to even compute it at all. And even if you did, after inlining from that C++ header the operands are compile-time constants, so the compiler will do it once at compile time instead of with a mulsd instruction at run time. And even if you made the args in main come from atof(argv[1]) or something, the compiler could hoist the computation out of the loop.

Any one of those 3 microbenchmark pitfalls would lead to timing an empty interval with no work between the two functions, other than saving the first now() result to different registers. You have all 3 problems.

You're literally timing an empty interval and still getting this much jitter because of the occasional interrupt, and the relatively high overhead of the library function wrapped around clock_gettime which ultimately runs an rdtsc instruction and scales it using values exported by the kernel. Fortunately it can do this in user-space, without actually using a syscall instruction to enter the kernel. (The Linux kernel exports code + data in the VDSO pages.)

Directly using rdtsc inside a tight loop does give fairly repeatable timings, but still has pretty high overhead relative to mulsd. (How to get the CPU cycle count in x86_64 from C++?).

Your mental model of execution cost is probably wrong at this level of detail. You can't just time individual operations and then add up their costs. Superscalar pipelined out-of-order execution means you have to consider throughput vs. latency, and lengths of dependency chains. (And also front-end bottlenecks vs. the throughput of any one kind of instruction, or execution port).

And no, disabling optimizations is not useful. That would turn this into a microbenchmark of call/ret through a nest of C++ functions, and maybe store-forwarding latency.

Benchmarking with optimizations disabled is useless. Typically you need to use inline asm to force the compiler to materialize a value in a register repeatedly in a loop, and/or forget what it knows about a variable's value to make it redo a calculation instead of hoisting it. e.g. see "Escape" and "Clobber" equivalent in MSVC (not the MSVC part, just the part in the question showing useful GNU C inline asm).

Why is there such a high variability in runtime for adding two values?

2 Answers2