C++: measure small elapsed times with precision

Question

In C++, with chrono, I need to measure small elapsed times with precision.

Googling this didn't help me much... As the topic seems messy...

As this stays unclear, I came to write my own dummy code with 2 things in mind:

Check timer resolution: How to get the precision of high_resolution_clock?.
Get maximum precision: use long double instead of double as time count: long double vs double.

Here is the code (test_elapse_measure.cpp):

#include <iostream>
#include <thread>
#include <chrono>

int main(int argc, char ** argv) {
    int n = (argc == 2) ? atoi(argv[1]) : 1;

    long double time_resolution = (long double) std::chrono::high_resolution_clock::period::num;
    time_resolution /= (long double) std::chrono::high_resolution_clock::period::den;
    std::cout << "time_resolution: " << time_resolution << " s" << std::endl;

    std::chrono::high_resolution_clock::time_point start, stop;
    {
        start = std::chrono::high_resolution_clock::now();
        std::this_thread::sleep_for(std::chrono::seconds(n));
        stop = std::chrono::high_resolution_clock::now();
        auto elaps = std::chrono::duration_cast<std::chrono::seconds>(stop - start);
        long double count = elaps.count(); // precision: long double.
        std::cout << "sleep: " << count << " s" << std::endl;
    }
    {
        start = std::chrono::high_resolution_clock::now();
        std::this_thread::sleep_for(std::chrono::milliseconds(n));
        stop = std::chrono::high_resolution_clock::now();
        auto elaps = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
        long double count = elaps.count(); // precision: long double.
        std::cout << "sleep: " << count << " ms" << std::endl;
    }
    {
        start = std::chrono::high_resolution_clock::now();
        std::this_thread::sleep_for(std::chrono::nanoseconds(n));
        stop = std::chrono::high_resolution_clock::now();
        auto elaps = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
        long double count = elaps.count(); // precision: long double.
        std::cout << "sleep: " << count << " ns" << std::endl;
    }
}

Running the code gives:

>> g++ -o test_elapse_measure test_elapse_measure.cpp

>> ./test_elapse_measure 
time_resolution: 1e-09 s
sleep: 1 s
sleep: 1 ms
sleep: 94129 ns

>> ./test_elapse_measure 10
time_resolution: 1e-09 s
sleep: 10 s
sleep: 10 ms
sleep: 82093 ns

>> ./test_elapse_measure 100
time_resolution: 1e-09 s
sleep: 100 s
sleep: 100 ms
sleep: 70069 ns

Ideally, I need to measure elapsed from 1 nanoseconds up to several seconds. Timing resolution seems to say that I could measure nanoseconds (the timer seems precise enough), but, all nanosecond elapsed time measure are wrong?!...

Why is that? What did I miss?

EDIT

Or is there not way to measure nanosecond elapsed time?

Found this afterwards: measuring precise time in nanoseconds C++.

Best one can measure reliably with C++ would be millisecond?

EDIT

Adding here a bit of context.

I work on an algorithm which has 3 parts in it (in an iterative loop), and, has also several parameters to it. On overall, the algorithm takes a "long" time mostly because there may be a large number of iterations (loop), but, the 3 parts in the loop may be "fast". Turns out that, with some parameters, each part in the loop may take several seconds to milliseconds. But, with some other parameters, each part in the loop may take several milliseconds to nanoseconds (as milliseconds count prints 0!). I was in the process to try to understand where time was spent, so I started using chrono deeper. Unluckily, it's a one-shot algorithm and I wanted to understand where time is spent "inside" the algorithm, so, it's quite difficult/impossible to "loop over each part (itself in a loop...) of the algorithm to get an averaged timing" (I thought about that first - not realistic as questions raise only when large graphs are passed as inputs to the algorithm which becomes tricky/intractable to debug such that profiling was possibly a way to understand what's going on and where).

From answers below, I understand this is no easy for good reasons. I actually expected that, but, I wanted to have external points of view.

At that point, say in a case I have an overall time of 3000 ms, when I time each part in the loop and sum timing across iterations, I get 300 ms! Which is 10 times less than 3000: I'd have accepted a 1.x factor but 10 is an order of magnitude lost in space?! So I was like "Ok, maybe I add up many many 0's as I use milliseconds counters but some parts may take less and they are seen as 0". For now still not sure what's going on: my timings are not consistent (overall magnitude very different from accumulated parts magnitude). Bugs are still possible although I wrote code as carefully as possible. For me, at this point, there is no bug (generated results are tested OK with expected results), but I still don't get why I have inconsistent timing (overall vs parts).

Ideally I was looking for a way, in all cases (even measuring nanoseconds), to have a sensible superior estimate (even not exact for reasons well explained in answers here) of the elapsed time spent (up to now, in nanoseconds, I get outer-space timings): I guess, there is no way to get that...

"*Best one can measure reliably with C++ would be millisecond?*" The question you linked to was talking about the OS, not the language you use to talk to the OS. The accuracy of sleep-for-duration is the issue here, and you don't get nanosecond accuracy with it. — Nicol Bolas, Sep 24 '22 at 13:30
I don't see anything in your code and results suggesting that the elapsed time isn't nanosecond accurate. However you have no guarantees how long (exactly) a sleep will be. — user17732522, Sep 24 '22 at 13:33
FWIW: finally, timing in milliseconds at high level, but, timing in nanoseconds at low level (converting back to ms) turns out to be a good tradeoff. You don't get accurate timings but quite consistent order of magnitudes, so that, it's possible to understand where time is spent. In my case, time is spent in data structure insertion (2d subpart using queue) but not graph node based computation (1st subpart): that was the initial question I tried to answer before going into this. — fghoussen, Sep 29 '22 at 11:08

TrentP · Answer 1 · 2022-09-24T15:06:13.957

You should account for the time it takes to make the measurements themselves. I.e., the system call overhead to get the time. Time a loop of now() calls to figure out the time it takes for a single call. This isn't significant for times over one second, but when you are measuring nanoseconds, it matters.

However, this alone does not explain why your 100 ns sleep is measured so long. The reason for that is: The sleep really is that long! It is not the measurement, but the sleep, that is off.

Sleep is just a minimum time. The granularity of the operating system re-scheduling the thread after it sleeps is not 1 nanosecond. In Linux, give the thread real-time priority (SCHED_FIFO or SCHED_RR), then it will be run much more promptly when it becomes runnable. It still won't be 1, 10 or 100 nanoseconds. But maybe 10,000 instead of 100,000.

Finally, one nanosecond is a very short time. Light travels one foot in a nanosecond. It's 4 cycles for a CPU running at 4 GHz. How many instructions do you think it takes to put a thread to sleep and wake up? More than 4!

You can't measure times as short as one nanosecond, or even tens of nanoseconds, accurately from software on a general purpose computer. You need specialized hardware for that. Once you time how long the now() call takes, you'll see that it is both more than 1 ns and that the variation in the time takes is also more than 1 ns.

OK, sounds it's not possible to measure ns. Added some context to the question. — fghoussen, Sep 24 '22 at 16:58
Use a profiling tool. Like perf in Linux. It samples the process's execution repeatedly and the distribution of where those samples are from tell you what portion of the time is spent where. — TrentP, Sep 24 '22 at 20:35

score 1 · Answer 2 · answered Sep 24 '22 at 14:24

Ideally, I need to measure elapsed from 1 nanoseconds up to several seconds.

This is not possible to measure something with the granularity of 1 ns on all mainstream processors. The most precise way to measure a very short code is to use the RDTSC instruction on x86-64 processors. This instruction typically takes several dozens of cycles. Moreover, modern processors execute instruction in parallel and in an out-of-order way so measuring the time of only few short instruction is simply not possible. In fact, the RDTSCP instruction can be used to limit a bit the out-of-order execution of the processor in such a case, but this is not enough and this instruction is also a bit slower. The only way to measure a short set of instructions is to repeat it in a loop. Micro-benchmarking is far from being easy and require a deep understanding of the target processor (which are becoming insanely complex nowadays).

Note that RDTSC provide a number of cycle. Modern processors use frequency scaling so it can be dificult to convert a number of cycle to a given time. That being said, AFAIK, new x86-64 processors now provide a number of cycle independent of the frequency.

all nanosecond elapsed time measure are wrong?!

Mainstream operating systems do not guarantee the sleeping time to be precise. The precision is generally about few milliseconds. This is because the OS schedule threads with a quantum of about few milliseconds (8 on my machine). The scheduler can be more precise regarding what other running threads (of typically other applications) do. Still, one should not expect timings to be more precise than few microsecond due to the time required to perform context switches (since sleeping cause a context switch).

OK, sounds it's not possible to measure ns. Added some context to the question. — fghoussen, Sep 24 '22 at 16:58
The use of RDTSC still applies with the added information. You can measure the overall time of the algorithm so to see if there is too many calls to the instruction impacting the runtime. I think It is better to avoid using it in a parallel context (eg. waiting for threads or comparing data between threads, though it should be OK now). — Jérôme Richard, Sep 28 '22 at 08:15
If the overall time is strongly impacted, then it means the granularity is too small and in this case there is not much to do. It is a bit like asking to someone what it is doing every 10 second: it will take much more time to do the work. — Jérôme Richard, Sep 28 '22 at 08:18

Rachid K. · Answer 3 · 2022-09-24T16:31:15.917

The services that you are using encapsulate Linux system calls. Using a tool like strace shows the following on my machine:

$ strace ./test_elapse_measure
[...]
write(1, "time_resolution: 1e-09 s\n", 25time_resolution: 1e-09 s
) = 25
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffcb6727f50) = 0
write(1, "sleep: 1 s\n", 11sleep: 1 s
)            = 11
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, 0x7ffcb6727f50) = 0
write(1, "sleep: 1 ms\n", 12sleep: 1 ms
)           = 12
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1}, 0x7ffcb6727f50) = 0
write(1, "sleep: 274136 ns\n", 17sleep: 274136 ns
)      = 17
exit_group(0)                           = ?
+++ exited with 0 +++

The above display shows that the C++ sleep_for() method of chrono class relies on clock_nanosleep() system call with the CLOCK_REALTIME clock identifier. Hence, the elapsed time includes the time necessary to run the system call itself (switch from user to kernel space and conversely). The less is the value of the delays, the more the cost of the system call is visible.

But there is no choice here as putting a process in the sleep state requires a switch into kernel side to change the process state and put it in the sleep queue.

We can notice, that the now() method of the chrono class does not trigger any system call visible in the strace output. It probably actually uses clock_gettime() which is a system call but mapped in the VDSO:

Why does the vDSO exist at all? There are some system calls the kernel provides that user-space code ends up using frequently, to the point that such calls can dominate overall performance. This is due both to the frequency of the call as well as the context- switch overhead that results from exiting user space and entering the kernel.

Few system calls are mapped into the VDSO. Concerning the timers, gettimeofday() and clock_gettime() are typically available through this mechanism. One trick to know if a system call is available or not through this mechanism, is to use strace tool. If the call appears in the captured system calls, than VDSO is not available for the given service. As we saw above, clock_nanosleep() is not part of VDSO as it appears in the output of strace but clock_gettime() which is normally used by now() method is part of the VDSO.

So, your problem does not come from the now() method but from the choice to measure sleep_for() duration which triggers the time consuming clock_gettime() system call.

If we re-run the program under strace with the timestamps (-tt option), we get a rough idea of the cost of the clock_nanosleep() system call:

$ strace -tt ./test_elapse_measure
[...]
18:09:24.462959 write(1, "time_resolution: 1e-09 s\n", 25time_resolution: 1e-09 s
) = 25
18:09:24.463214 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff359ecf70) = 0
18:09:25.463808 write(1, "sleep: 1 s\n", 11sleep: 1 s
) = 11
18:09:25.464043 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, 0x7fff359ecf70) = 0
18:09:25.465512 write(1, "sleep: 1 ms\n", 12sleep: 1 ms
) = 12
18:09:25.465705 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1}, 0x7fff359ecf70) = 0
18:09:25.465891 write(1, "sleep: 180681 ns\n", 17sleep: 180681 ns
) = 17
18:09:25.466072 exit_group(0)           = ?
18:09:25.466623 +++ exited with 0 +++

The timestamps show:

1,000594 elapsed seconds for a sleep_for(1 second);
0,001469 elapsed seconds for a sleep_for(1 millisecond);
0,000186 elapsed seconds for a sleep_for(1 nanosecond).

OK. Looking for a way to have a correct estimate of elapsed time. Not sure if this is possible though — fghoussen, Sep 24 '22 at 17:01

C++: measure small elapsed times with precision

3 Answers3