10

I'm trying to time how long clock_gettime(CLOCK_REALTIME,...) takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.

I wrote some test code that used __rdtscp to time repeated calls to clock_gettime (the rdtscp calls went around a loop that called clock_gettime and added the results together, just so the compiler wouldn't optimize too much away).

If I call clock_gettime() in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:

42467
1114
1077
496
455

However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:

Here at 10 second sleeps:

28293
1093
4729
4756
4736

Here at 1 second sleeps:

61578
855
4753
4741
5645
4753
4732

Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?

tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...) don't understand why it runs faster when called in quick succession as opposed to a second between calls.

Update: Here's the cpuinfo on proc 0

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping    : 9
microcode   : 0x84
cpu MHz     : 2800.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs        :
bogomips    : 5616.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Here is the recreated test code:

#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops

unsigned long long now() {
    struct timespec s;
    clock_gettime(CLOCK_REALTIME, &s);
    return (s.tv_sec * 1000000000ull) + s.tv_nsec;
}

int main(int argc, char **argv) {
    int sleeptime = atoi(argv[1]);
    int trials = atoi(argv[2]);
    int loops = atoi(argv[3]);

    unsigned long long x, y, n = 0;
    unsigned int d;


    x = __rdtscp(&d);
    n = now();
    asm volatile("": "+r" (n));
    y = __rdtscp(&d);

    printf("init run %lld\n", (y-x));

    for(int t = 0; t < trials; ++t) {
        if(sleeptime > 0) sleep(sleeptime);
        x = __rdtscp(&d);
        for(int l = 0; l < loops; ++l) {
            n = now();
            asm volatile("": "+r" (n));
        }
        y = __rdtscp(&d);
        printf("trial %d took %lld\n", t, (y-x));
    }

    exit(0);
}
JasonN
  • 1,339
  • 1
  • 15
  • 27
  • 1
    You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a `/prof/cpuinfo` dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right? – Kuba hasn't forgotten Monica Nov 11 '18 at 19:08
  • 1
    vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I – Kuba hasn't forgotten Monica Nov 11 '18 at 19:18
  • @Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues. – JasonN Nov 11 '18 at 21:17
  • @kuba strace verifies that the clock_gettime calls are not syscalls – JasonN Nov 11 '18 at 23:20
  • Are the results the same if you pin the program to just 1 core (use e.g. the taskset command to launch the test program with just 1 core). And even a rather idle desktop computer have a lot to do maintaining the network, graphics and desktop stuff, so do not discount the caches being blown away at random times – nos Nov 13 '18 at 20:13
  • @nos just tried it a few times "taskset -c 0 ..." and it doesn't change anything. – JasonN Nov 13 '18 at 21:04
  • You can't time hardware using the hardware you're timing. You'll need to run your loops, count them and time it with something external. – Terry Carmen Nov 13 '18 at 21:29
  • 1
    @terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong). – JasonN Nov 13 '18 at 23:06

2 Answers2

5

The very first time clock_gettime is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime will take much more time than later calls, which is what you are observing.

The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now and printf, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime takes more time to execute.

Now regarding the second and later measurements. From the question:

42467
1114
1077
496
455

I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now and printf contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp instruction itself may vary by 4 cycles. See also this.

In practice, the clock_gettime is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger." – JasonN Nov 18 '18 at 23:23
  • if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks. – JasonN Nov 18 '18 at 23:26
  • 1
    @JasonN This explains the impact of calling `sleep` and passing an argument larger than zero seconds. It takes more time to call `clock_gettime` (about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter. – Hadi Brais Nov 18 '18 at 23:26
  • Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the `sleep 1` is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing `perf -e L1-icache-load-misses`. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh. – JasonN Nov 20 '18 at 16:10
  • 2
    @JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain `now` and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses. – Hadi Brais Nov 20 '18 at 18:48
  • I'm tasksetting it, so it is running on the same core, but that another process is getting the cpu and causing L1I evictions seems very plausible. – JasonN Nov 20 '18 at 20:10
  • Any time you give up the CPU for seconds, you can probably assume there is a very good chance that you are returning to a "cold cache". You could test the difference between a true OS sleep (lets other processes be scheduled) vs a hot sleep that just loops incrementing a variable or something, where the OS will be _less_ likely to schedule your process out to isolate the different. `perf stat` also reports the context switches which is very handy, and `chrt -rr 99` will run your process at a very high priority such that it probably won't be preemptively swapped out. – BeeOnRope Nov 27 '18 at 17:45
1

I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).

For example:

./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)

Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc-based clock_gettime only if it seems suitable on the current system.

The second most likely reason is that clock_gettime(CLOCK_REALTIME) doesn't go through the VDSO on your system, so is a system call even if rdtsc is ultimately being used. I guess that could be due to an old libc version or something like that.

The third most likely reason is that rdtsc on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.

Single Loop Results

Trying with a single clock_gettime call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1 gives:

init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...

Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf line was modified to:

printf("trial %d took %lld (%lld cycles per call)\n", t, (y-x), (y-x)/loops);
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 1
    When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is `4700 + 99*50` which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions. – JasonN Nov 26 '18 at 00:03
  • @JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there. – BeeOnRope Nov 27 '18 at 17:40
  • You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy. – BeeOnRope Nov 27 '18 at 17:41
  • 1
    You can reproduce the pattern by passing a value of 1 to the third argument `loops`. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where `loops` is 1. But note that `init run` still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always. – Hadi Brais Nov 27 '18 at 20:15
  • @HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes `rdtsc` overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call. – BeeOnRope Nov 27 '18 at 21:34
  • The absolute numbers are also different on my system (I don't remember now what they were) but they were smaller than what the OP shows. But notice how they may vary by more than 2X. This applies to your numbers and the OP's numbers. IIRC, the time per call correlates well with `ICACHE.IFETCH_STALL`. Interrupts and thread scheduling are the probable causes for these variations on the same system and across our different systems. Also I suspect the OP didn't fix the core frequency on their system so it would be difficult to make an absolute comparison. – Hadi Brais Nov 27 '18 at 21:51
  • If you divide by `loops`, with sleeping, you'd be amortizing the high cost of the first call across all calls in the inner loop, so you'll lose the pattern anyway especially when `loops` is large. I don't know why the OP has this inner loop in the first place, it doesn't seem to be necessary for the question. – Hadi Brais Nov 27 '18 at 21:54
  • @HadiBrais - sure but, I'm not sleeping. There are really two separate questions here: "what's up with the transient behavior of `clock_gettime` when you launch the process and/or after some type of sleep" and "what is the typical performance of `clock_gettime` when you call it a lot". I'm really looking at the second part (you looked more at the first part), because the OP mentioned that he only gets to 500 cycles after some iterations, which seems slow (even with 1 sec sleep) and because I came here from HN where the OP asked me how I got such "fast" timings (25ns) for gettime. – BeeOnRope Nov 27 '18 at 22:04
  • Yes without sleeping I think dividing by `loops` is statistically better. Oh now I understand. Yes 500 cycles too high compared to my Haswell system as well. You can try fixing your core frequency at some small value like 1.0GHz and see if it takes around 500 cycles. – Hadi Brais Nov 27 '18 at 22:11
  • Also I'm using Ubuntu 16.04 and gcc 5.4.0 but the OP is using Fedora 27. I don't know whether the implementation of `clock_gettime` is different on these two systems. What about you? – Hadi Brais Nov 27 '18 at 22:13
  • @HadiBrais - I don't know if it's much different, but I wouldn't expect it as the OP has already confirmed he's using the VDSO - but he's seeing almost 5x slower performance (or even more if the 500 cycle comment applied to a zero sleep case), so I don't imagine small differences could explain that. – BeeOnRope Nov 27 '18 at 22:22