Why do I see 400x outlier timings when calling clock_gettime repeatedly?

Question

I'm trying to measure execution time of some commands in c++ by using the physical clock, but I have run into a problem that the process of reading off the measurement from the physical clock on the computer can take a long time. Here is the code:

#include <string>
#include <cstdlib>
#include <iostream>
#include <math.h>
#include <time.h>

int main()
{
      int64_t mtime, mtime2, m_TSsum, m_TSssum, m_TSnum, m_TSmax;
      struct timespec t0;
      struct timespec t1;
      int i,j;
      for(j=0;j<10;j++){
      m_TSnum=0;m_TSsum=0; m_TSssum=0; m_TSmax=0;
      for( i=0; i<10000000; i++) {
            clock_gettime(CLOCK_REALTIME,&t0);
            clock_gettime(CLOCK_REALTIME,&t1);
            mtime = (t0.tv_sec * 1000000000LL + t0.tv_nsec);
            mtime2= (t1.tv_sec * 1000000000LL + t1.tv_nsec);

            m_TSsum += (mtime2-mtime);
            m_TSssum += (mtime2-mtime)*(mtime2-mtime);
            if( (mtime2-mtime)> m_TSmax ) { m_TSmax = (mtime2-mtime);}
            m_TSnum++;
      }
      std::cout << "Average "<< (double)(m_TSsum)/m_TSnum
            << " +/- " << floor(sqrt( (m_TSssum/m_TSnum  - ( m_TSsum/m_TSnum ) *( m_TSsum/m_TSnum ) ) ) )
            << " ("<< m_TSmax <<")" <<std::endl;
      }
}

Next I run it on a dedicated core (or so the sysadmin tells me), to avoid any issues with process being moved to background by scheduler:

$ taskset -c 20 ./a.out

and this is the result I get:

Average 18.0864 +/- 10 (17821)
Average 18.0807 +/- 8 (9116)
Average 18.0802 +/- 8 (8107)
Average 18.078 +/- 6 (7135)
Average 18.0834 +/- 9 (21240)
Average 18.0827 +/- 8 (7900)
Average 18.0822 +/- 8 (9079)
Average 18.086 +/- 8 (8840)
Average 18.0771 +/- 6 (5992)
Average 18.0894 +/- 10 (15625)

So clearly it takes about 18 nanosecond (on this particular server) to call clock_gettime(), but what I can't understand why the "max" time seems to be between 300 and 1000 times longer?

If we assume that the core is truly dedicated to this process and not used by something else (which may or may not be true; when not running on dedicated core, the average time is the same, but the sd/max are somewhat bigger), what else could cause these "slowdowns" (for the lack of a better name)?

If you have access to C++11 you might want to consider using [``](http://en.cppreference.com/w/cpp/chrono) insteaf og `time.h` — NathanOliver, Mar 09 '18 at 19:23
Read up on (and use) [std::chrono](http://en.cppreference.com/w/cpp/chrono). — Jesper Juhl, Mar 09 '18 at 19:24
Dedicated core doesn't mean there aren't OS interrupts being handled by the same core. For nanosecond precision you need to look into RTOS. — rustyx, Mar 09 '18 at 20:06
`std::chrono` isn't going to do magic - under the covers it will just delegate to `clock_gettime` or another similar call. — BeeOnRope, Mar 09 '18 at 20:36
What type of hardware (CPU, mostly) are you running this on? — BeeOnRope, Mar 09 '18 at 21:48
FWIW, I removed your reference to reading the "physical clock" which is probably confusing. You can be sure that no "physical clock" (if you mean some type of uncore/offcore realtime clock) is being read in _18 nanoseconds_. The fast-path for functions like `clock_gettime` is something like a read of the core-local "timestamp counter" using `rdtsc`, entirely in userspace, plus some adjustments to account for per-core offsets and to turn the cycle counter into realtime. Reading real clocks takes thousands of cycles or more. Your question still stands with this change though. — BeeOnRope, Mar 09 '18 at 22:08
Thanks @BeeOnRope for clarifying the question; as for the hardware, this is what I see in "dmesg" output: smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz (fam: 06, model: 3f, stepping: 02) — Bojan, Mar 10 '18 at 02:20
You can't be serious @NathanOliver. under the hood uses clock_gettime() — Wojciech Kudla, Nov 30 '20 at 18:34

score 11 · Answer 1 · edited Jun 20 '20 at 09:12

Why Outliers?

There are many software and hardware related reasons why you might see outlier events (and non-outlier variation) when you iterate 10 million times over two clock_gettime calls. These reasons include:

Context switches: the scheduler may decide to migrate your process between CPUs, and even if you pin your process to a CPU, the OS may periodically decide to run something else on your logical CPU.
SMT: assuming this is on a CPU with SMT (e.g., hyperthreading on x86) the scheduler will probably periodically schedule something on the sibling core (same physical core as your process). This can dramatically affect the overall performance of your code since two threads are competing for the same core resources. Futhermore, there is probably a transition period between SMT and non-SMT execution where nothing executes since the core has to re-paritition some resources when SMT execution begins.
Interrupts: A typical system will receiving hundreds of interrupts per second at a minimum, from the network card, graphics devices, hardware clocks, system timers, audio devices, IO devices, cross-CPU IPIs, and so on. Try a watch -n1 cat /proc/interrupts and see how action is occurring on what you might think is an otherwise idle system.
Hardware pauses: the CPU itself may periodically stop executing instructions for a variety of reasons such as power or thermal throttling, or just because the CPU is undergoing a frequency transition.
System Management Mode: totally apart from interrupts seen and handled by the OS, x86 CPUs have a type of "hidden interrupt" which allows SMM functionality to execute on your CPU, with the only apparent affect being periodic unexpected jumps in cycle counters used to measure real time.
Normal performance variations: your code won't execute in exactly the same way every time. Initial iterations will suffer data and instruction cache misses, and have untrained predictors for things like branch direction. Even in an apparent "steady state" you may still suffer performance variations from things beyond your control.
Different code paths: you might expect your loop to execute exactly the same instructions every time through¹: after all, nothing is really changing, right? Well if you dig into the internals of clock_gettime you may very well find something branches that take a different path when some times of overflow occurs, or when reading from the adjustment factors in the VDSO races with an update, etc.

That's not even a comprehensive list, but it should at least give you a taste of some of the factors that can cause outliers. You can eliminate or reduce the effect of some of these, but complete control is generally impossible on a modern non-realtime² OS on x86.

My Guess

If I had to take a guess, based on a typical outlier of ~8000 ns, which is probably too small for a context switch interruption, you are probably seeing the effect of processor frequency scaling due to variable TurboBoost ratios. That's a mouthful, but basically modern x86 chips run at different "max turbo" speeds depending on how many cores are active. My i7-6700HQ, for example, will run at 3.5 GHz if one core is active, but only 3.3, 3.2 or 3.1 GHz if 2, 3 or 4 cores are active, respectively.

This means that even if your process is never interrupted, any work at all which runs even briefly on another CPU may cause a frequency transition (e.g., because you transition fromm 1 to 2 active cores), and during such a transition the CPU is idled for thousands of cycles while voltages stabilize. You can find some detailed numbers and tests in this answer but the upshot is that on the tested CPU the stabilization takes roughly 20,000 cycles, very much in line with your observed outliers of ~8000 nanoseconds. Sometimes you might get two transitions in a period which doubles the impact, and so on.

Narrow It Down

Get a Distribution

If you still want to know the cause of your outliers, you can take the following steps and observe the effect on the outlier behavior.

First, you should collect more data. Rather than just recoding the max over 10,000,000 iterations, you should collect a histogram with some reasonable bucket size (say 100 ns, or even better some type of geometric bucket size that gives higher resolution for shorter times). This will be a huge help because you'll be able to see exactly where the times are clustering: it is entirely possible that you have other effects other than the 6000 - 17000 ns outliers that you note with "max", and they can have different causes.

A histogram also lets you understand the outlier frequency, which you can correlate with frequencies of things you can measure to see if they match up.

Now adding the histogram code also potentially adds more variance to the timing loop, since (for example) you'll be accessing different cache lines depending on the timing value, but this is manageable, especially because the recording of the time happens outside the "timed region".

Issue Specific Mitigations

With that in hand, you can try to systematically check the issues I mentioned above to see if they are the cause. Here are some ideas:

Hyperthreading: Just turn it off in the BIOS while running single-threaded benchmarks which eliminates that whole class of issues in one move. In general, I've found that this also leads to a giant reduction in fine-grained benchmark variance, so it's a good first step.
Frequency scaling: On Linux, you can usually disable sub-nominal frequency scaling by setting the performance governor to "performance". You can disable super-nominal (aka turbo) by setting /sys/devices/system/cpu/intel_pstate/no_turbo to 0 if you're using the intel_pstate driver. You can also manipulate the turbo mode directly via MSR if you have another driver, or you can do it in the BIOS if all else fails. In the linked question the outliers basically disapear when turbo is disabled, so that's something to try first.

Assuming you actually want to keep using turbo in production, you can limit the max turbo ratio manually to some value that applies to N cores (e.g,. 2 cores), and then offline the other CPUs so at most that number of cores will ever be active. Then you'll be able to run at your new max turbo all the time no matter how many cores are active (of course, you might still be subject to power, current or thermal limits in some cases).
Interrupts: you can search for "interrupt affinity" to try to move interrupts to/from your pinned core and see the effect on the outlier distribution. You can also count the number of interrupts (e.g., via /proc/interrupts) and see the count is enough to explain the outlier count. If you find that timer interrupts specifically are the cause, you can explore the various "tickless" (aka "NOHZ") modes your kernel offers to reduce or eliminate them. You can also count them directly via the HW_INTERRUPTS.RECEIVED performance counter on x86.
Context switches: you can use realtime priorities or isolcpus to prevent other processes from running on your CPU. Keep in mind that context switch issues, while usually positioned as the main/only issue, are actually fairly rare: at most they generally happen at the HZ rate (often 250/second on modern kernels) - but it will be rare on a mostly idle system that the scheduler would actually decide to scheduler another process on your busy CPU. If you make your benchmark loops short, you can generally almost entirely avoid context switches.
Code related performance variations: you can check if this is happening with various profiling tools like perf. You can carefully design the core of your packet handling code to avoid outlier events like cache misses, e.g., by pre-touching caching lines, and you could avoid the use of system calls with unknown complexity as much as possible.

While some of the above are purely for investigative purposes, many of them will both help you determine what's causing the pauses and also mitigate them.

I'm not aware of mitigations for all issues however - stuff like SMM you'd perhaps need specialized hardware or BIOS to avoid.

¹ Well except perhaps in the case that the if( (mtime2-mtime)> m_TSmax ) condition is triggered - but this should be rare (and perhaps your compiler has made it branch-free, in which case there is only one execution path).

² It's not actually clear you can get to "zero variance" even with a hard realtime OS: some x86-specific factors like SMM mode and DVFS related stalls seem unavoidable.

Thanks @BeeOnRope for the very detailed explanation. I'll add some code to put times into buckets and draw a histogram out of it. Hopefully that will shed some extra light on the problem. I don't really care if I get "zero variance" or not; As long as the worst case scenario is reasonable (say under 100 nano seconds), I'd be more or less satisfied. — Bojan, Mar 10 '18 at 02:14
This whole thing started as I tried to figure out why I'm seeing packets dropped from a multicast UDP feed; there are occasional data bursts that come at the rate of about 400000 per second, implying that I have to process them under 2.5 microseconds to avoid data queuing in the buffer. By measuring time and optimizing the code somewhat, I have reduced the average time to under 1 microsecond, but I still see packets dropped from time to time, and I'm trying to figure out what's causing it... — Bojan, Mar 10 '18 at 02:16
Very detailed and to the point explanation. +1. @Bojan I personally don't find it reasonable to expect 100 ns as worst case scenario delay (especially on non-realtime OS). It is always better to avoid that kind of assumptions when designing algorithms. What it wrong with data queuing in the buffer? (Why you need to avoid it?) — Serge Dundich, Mar 10 '18 at 14:13
@Serge there's nothing wrong with data queuing in the buffer (although obviously you want to avoid it if possible), the problem is that buffer gets over filled and system starts dropping packages, causing data losses... — Bojan, Mar 10 '18 at 16:53
@Bojan - worst case response time of 100 ns is going to be very difficult to achieve and probably require specialized hardware and software (e.g., user-mode networking stack). Consider that a single miss to DRAM is usually in the range of 100 ns, and that with the Meltdown and Spectre patches a single kernel call is perhaps 300: so if you need a user-kernel transition per packet you'll never make that deadline. The queuing is there for a reason - it is _not_ at all obvious "that you'd want to avoid it" - queuing packets not only helps you avoid drops on small pauses as you are seeing ... — BeeOnRope, Mar 10 '18 at 19:19
... but also often makes the whole processing pipeline more efficient since you can handle things in batches, reducing user - kernel transitions, amortizing the costs of various operations, etc. So what you really need is at least an average processing time of 2.5 us, but also to characterize the pauses and see if your buffers/queues are large enough to avoid the hiccups. As per my list above, many of the sources of hiccups can be eliminated or reduced, as well. — BeeOnRope, Mar 10 '18 at 19:21

Serge Dundich · Answer 2 · 2018-03-10T11:28:54.173

3

taskset command defines the affinity of YOUR process which means that YOUR process is restricted to run on the CPU cores specified. It doesn't restrict other processes in any way which means that any of them can preempt your process at any moment (as all of them are allowed to run on the CPU core that you chose for your process). So your maximum time reading intervals (those 5-25 usec) may represent other process or interrupts running time on your CPU plus context switch time. Besides you use CLOCK_REALTIME which may be subject to NTP corrections etc. To measure time intervals you should use CLOCK_MONOTONIC (or linux-specific CLOCK_MONOTONIC_RAW).

edited Mar 10 '18 at 11:28

answered Mar 09 '18 at 20:21

Serge Dundich

4,221
2
21
16

Thanks Serge. I have tried CLOCK_MONOTONIC and all other variants, and the results are the same (as they are with rtdsc and std::chrono. Kernel scheduler is set up in a way that the particular core that I use is never assigned to any process (unless you manually assign it using taskset or something similar), so if what sysadmins tell me is really true, this core should never try to switch to a different process... – Bojan Mar 10 '18 at 01:54
@Bojan Even if admin configured the scheduler so that it never includes your CPU core by default for the new processes from the very start of the system (which doesn't seem very likely but possibly true) there is still a possibility that time reading mechanism itself requires periodic synchronization of per-CPU timer (rdtsc) with some HW clock to update correction parameters - that would imply receiving periodic timer interrupts on each core. – Serge Dundich Mar 10 '18 at 11:45

Alan Birtles · Answer 3 · 2018-03-09T20:11:21.950

-2

This is so much easier in modern c++

#include <chrono>
auto start = std::chrono::steady_clock::now();
.....
auto stop = std::chrono::steady_clock::now();
auto duration = stop - start;

18 nanoseconds is quite quick for a non-realtime operating system. Do you really need to measure something more accurately than that? According to my calculations 18ns is only 72 clock cycles on a 4GHz CPU.

edited Mar 09 '18 at 20:11

answered Mar 09 '18 at 20:05

Alan Birtles

32,622
4
31
60

5

I don't think that author complains about 18 nsec average. I think that 21 usec maximum is what supposedly unexpected here (not really). And actually `std::chrono` likely uses `clock_gettime` internally anyway (on UNIX-based systems) so it won't be any different. But `std::chrono::steady_clock` will likely use `CLOCK_MONOTONIC` which is better than authors choice of `CLOCK_REALTIME` (which is probably used in `std::chrono::system_clock`). – Serge Dundich Mar 09 '18 at 20:33