I'm trying to time how long clock_gettime(CLOCK_REALTIME,...)
takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.
I wrote some test code that used __rdtscp
to time repeated calls to clock_gettime
(the rdtscp
calls went around a loop that called clock_gettime
and added the results together, just so the compiler wouldn't optimize too much away).
If I call clock_gettime()
in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:
42467
1114
1077
496
455
However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:
Here at 10 second sleeps:
28293
1093
4729
4756
4736
Here at 1 second sleeps:
61578
855
4753
4741
5645
4753
4732
Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?
tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...)
don't understand why it runs faster when called in quick succession as opposed to a second between calls.
Update: Here's the cpuinfo on proc 0
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Here is the recreated test code:
#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>
// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops
unsigned long long now() {
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;
}
int main(int argc, char **argv) {
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);
unsigned long long x, y, n = 0;
unsigned int d;
x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("init run %lld\n", (y-x));
for(int t = 0; t < trials; ++t) {
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l) {
n = now();
asm volatile("": "+r" (n));
}
y = __rdtscp(&d);
printf("trial %d took %lld\n", t, (y-x));
}
exit(0);
}