Two consequent std::chrono::high_resolution_clock::now() gives ~270ns difference

Question

I want to measure duration of a piece of code with a std::chrono clock, but it seems too heavy to measure something that lasts nanoseconds. That program:

#include <cstdio>
#include <chrono>

int main() {
    using clock = std::chrono::high_resolution_clock;

    // try several times
    for (int i = 0; i < 5; i++) {
        // two consequent now() here, one right after another without anything in between
        printf("%dns\n", (int)std::chrono::duration_cast<std::chrono::nanoseconds>(clock::now() - clock::now()).count());
    }
    return 0;
}

Always gives me around 100-300ns. Is this because of two syscalls? Is it possible to have less duration between two now()? Thanks!

Environment: Linux Ubuntu 18.04, kernel 4.18, load average is low, stdlib is linked dynamically.

Run the "fast" code multiple times so it cost more to run it then it takes to time it? — NathanOliver, Mar 15 '19 at 19:20
I updated the code sample and added a loop. It doesn't change a lot: first iteration is slower, but then it run around 100-200ns anyway. Also notice that there's no code at all, it's two now(), one right after another. — Juster, Mar 15 '19 at 19:23
@Juster NathanOliver's suggestion is to make a single measurement for the entire loop and divide that measurement by the number of iterations such that you get an average time. Edit : It doesn't apply here (you're measuring the time it takes to measure itself) but it could for whatever you eventually want to actually measure. — François Andrieux, Mar 15 '19 at 19:24
That average time for that program will be 200ns for empty code. — Juster, Mar 15 '19 at 19:25
It's just now() - now(), nothing in between. You may call it million times, get an average and it will be around 200ns anyway. — Juster, Mar 15 '19 at 19:26
@Juster I mean run the code you are trying to time multiple times. Lets say the code you want to test should take 10ns to run. call `now`, store the result, run the code a million times, call `now` again, and take the difference. Then divide that result by a million to get a more accurate idea of what it is doing. — NathanOliver, Mar 15 '19 at 19:27
@NathanOliver Beware that on the scale of 10ns, loop overhead may become dominant. You may need to manually unroll a large number of iterations so the overhead of looping is relatively less significant. — François Andrieux, Mar 15 '19 at 19:31
I recently made similar measurements of `std::chrono` on RHEL with gcc-8.2 and found `now()` from both `system_clock` and `steady_clock` took 275ns. — John, Mar 15 '19 at 19:32
@NathanOliver, there's no code I'm trying to measure, I'm trying to understand why now()-now() takes 200ns — Juster, Mar 15 '19 at 19:34
@FrançoisAndrieux Sure. There is even the overhead of function calling that will get in the way and may require manual inlining. Micro-benchmark is half art and half science almost. — NathanOliver, Mar 15 '19 at 19:34
@Juster I was answering *If not, how is it better to measure time of a fast (tens and hundreds of nanoseconds) piece of code?*. You can still use `std::chrono::high_resolution_clock` you just need to spread the cost of calling it over enough test where it basically averages away. It's like Big O notation and ignoring the constant factor. — NathanOliver, Mar 15 '19 at 19:36
Are you trying to benchmark a snippet in development or instrument high-performance code for deployment? — John, Mar 15 '19 at 19:45
To combine the answers below that read low level cpu ticks (e.g. `rdtsc` and friends), with the `` infrastructure, see https://stackoverflow.com/a/11485388/576911 This gives you a high-level chrono interface built on a low-level timer. — Howard Hinnant, Mar 15 '19 at 22:42

score 4 · Answer 1 · edited Mar 15 '19 at 20:17

4

Use rdtsc instruction to measure times with the highest resolution and the least overhead possible:

#include <iostream>
#include <cstdint>

int main() {
    uint64_t a = __builtin_ia32_rdtsc();
    uint64_t b = __builtin_ia32_rdtsc();
    std::cout << b - a << " cpu cycles\n";
}

Output:

19 cpu cycles

To convert the cycles to nanoseconds divide cycles by the base CPU frequency in GHz. For example, for a 4.2GHz i7-7700k divide by 4.2.

TSC is a global counter in the CPU shared across all cores.

Modern CPUs have a constant TSC that ticks at the same rate regardless of the current CPU frequency and boost. Look for constant_tsc in /proc/cpuinfo, flags field.

Also note, that __builtin_ia32_rdtsc is more effective than the inline assembly, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48877

edited Mar 15 '19 at 20:17

John

7,301
2
16
23

answered Mar 15 '19 at 19:38

Maxim Egorushkin

131,725
17
180
271

1

Interesting, thanks! How that will work when CPU frequency is in "scaling" mode? – Juster Mar 15 '19 at 19:40
@Juster Modern CPUs have constant TSC. Look for `constant_tsc` in `/proc/cpuinfo`. – Maxim Egorushkin Mar 15 '19 at 19:41
What happens if the OS reschedules your process on another CPU? There's no guarantee different CPUs have synchronized TSCs. – janneb Mar 15 '19 at 19:51
@janneb I cannot tell about multiple CPU sockets, see https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/706346 for more info. But when your thread is de-scheduled, your time measurement becomes invalid in any case. – Maxim Egorushkin Mar 15 '19 at 19:53

AlbertM · Answer 2 · 2019-03-15T19:40:59.027

2

If you want to measure the duration of very fast code snippets it is generally a good idea to run them multiple times and take the average time of all runs, the ~200ns that you mention will be negligible then because they are distributed over all runs.

Example:

#include <cstdio>
#include <chrono>
using clock = std::chrono::high_resolution_clock;

auto start = clock::now();
int n = 10000; // adjust depending on the expected runtime of your code
for (unsigned int i = 0; i < n; ++i)
    functionYouWantToTime();
auto result =
    std::chrono::duration_cast<std::chrono::nanoseconds>(start - clock::now()).count() / n;

edited Mar 15 '19 at 19:40

answered Mar 15 '19 at 19:35

AlbertM

1,246
14
25

No, the ~200ns is a fixed overhead from clock functions. Juster wants to know whether there is a way to reduce the overhead (in order to measure small functions). It seems not possible with the current `clock` classes. – Yongwei Wu Apr 22 '20 at 04:48
@YongweiWu I am fully aware that this is a fixed overhead. If you take a look at the example code in my answer you will see that I am adressing exactly this problem: I only call the clock function once, use it to time a large number of function executions and subsequently calculate the average time per function call. Therefore the overhead becomes insignificant when you choose a sufficiently large `n`. – AlbertM Apr 23 '20 at 09:38
Albert, I am sorry for not reading your answer more carefully. Yes, it may solve the problem, but only when you can repeat the function call many times and measure the total time from outside the function. It is not quite feasible, when you are profiling in a real running program. I am not sure about Juster's case, but I came here because I had problems profiling with high_resolution_clock. I had to switch to using rdtsc. – Yongwei Wu Apr 24 '20 at 06:26

score 2 · Answer 3 · answered Mar 15 '19 at 19:41

Just do not use time clocks for nanoseconds benchmark. Instead, use CPU ticks - on any hardware modern enough to worry about nanoseconds, CPU ticks are monotonic, steady and synchronized between cores.

Unfortunately, C++ does not expose CPU tick clock, so you'd have to use RDTSC instruction directly (it can be nicely wrapped in the inline function or you can use compiler's intrinsics). The difference in CPU ticks could also be converted into time if you so desire (by using CPU frequency), but normally for such a low-latency benchmarks it is not necessary.

Two consequent std::chrono::high_resolution_clock::now() gives ~270ns difference

3 Answers3