clock_gettime might be very slow even using VDSO

Question

I'm using CentOS Linux release 7.3.1611 on Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz

During tests of my userspace application, I have noticed that clock_gettime(CLOCK_MONOTONIC, &ts) may take up to 5-6 microseconds instead of ~23 nanoseconds in average. It may happen only once per 10000 consequent calls, however it may happen.

If there were no VDSO library, it could be explained. However, VDSO is used for every clock_gettime (I checked it by strace).

No matter if corresponding thread is affined to certain CPU core, or not. No matter, if this CPU core isolated from OS, or not. It means test app may run on exclusive CPU core, while lag may appear anyway!

I'm measuring latency by comparing results of two consequent clock_gettime calls, like:

unsigned long long __gettimeLatencyNs() {
    struct timespec t1_ts;
    struct timespec t2_ts;
    clock_gettime(CLOCK_MONOTONIC, &t1_ts);
    clock_gettime(CLOCK_MONOTONIC, &t2_ts);
    return ((t2_ts.tv_sec - t1_ts.tv_sec)*NANO_SECONDS_IN_SEC + t2_ts.tv_nsec - t1_ts.tv_nsec);
}

Could anyone share some ideas, what could be wrong there?

score 7 · Answer 1 · answered Aug 24 '17 at 15:10

Let's look at the source code for clock_gettime:

/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
notrace static int __always_inline do_realtime(struct timespec *ts)
{
    unsigned long seq;
    u64 ns;
    int mode;

    do {
        seq = gtod_read_begin(gtod);
        mode = gtod->vclock_mode;
        ts->tv_sec = gtod->wall_time_sec;
        ns = gtod->wall_time_snsec;
        ns += vgetsns(&mode);
        ns >>= gtod->shift;
    } while (unlikely(gtod_read_retry(gtod, seq)));

    ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
    ts->tv_nsec = ns;

    return mode;
}

What we see here is that the code runs inside a loop. This loop is annotated with an unlikely condition. The condition has to do with the fact that this code reads shared memory that is sometimes updated, and while it is being updated, the code needs to wait for the update to complete.

The most likely answer to your question, then, is that every so often you catch clock_gettime while the corresponding kernel code is updating its structures. When that happens, the code runs significantly slower.

It seems, we can encounter waiting "while kernel is updating something" not only in synthetic test like this. And it is rather strange if kernel needs a few microseconds just for some memory update... — Konstantin Utkin, Aug 25 '17 at 02:53

BeeOnRope · Answer 2 · 2022-09-12T02:54:49.960

I don't think it's the logic of the clock_gettime call itself that is periodically taking longer, but rather than your timing loop is periodically being interrupted, and this extra time shows up as an extra long interval.

That is, any type of timing loop is subject to being interrupted by external events, such as interrupts. For example, except with a very specific tickless kernel configuration (not the default), your application will be interrupted periodically by the scheduler interrupt, which will do a bit of processing to see if another process should run. Even if no other process ultimately ends up running, this could easily account for a few microseconds.

In addition, the hardware may temporarily pause for a variety of reasons, such as frequency transitions that occur when other cores enter or leave the idle state. I have measured these transitions at around 8 microseconds, close to the value you report. During these pauses, the CPU isn't executing instructions, but the TSC keeps running, so it shows up as an extra-long interval.

Beyond that, there are a ton of reasons why you would experience outlier timings. That answer also includes ways in which you could narrow down the possible reasons if it interests you.

Finally, the answer suggestions that clock_gettime itself may be blocking, while the kernel updates the data structure. While it's certainly possible, I think it's less likely than the other reasons. You could copy and paste the VDSO code, and then modify it to record if any blocking actually happened, and call that to see if your pauses correlate with blocking. I would guess not.

clock_gettime might be very slow even using VDSO

2 Answers2