24

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.

Previously our implementation used inline assembly and the rdtsc operation to query the high-frequency timer from the CPU directly, but this is problematic and requires frequent recalibration.

So I tried using the clock_gettime function instead to query CLOCK_PROCESS_CPUTIME_ID. The docs allege this gives me nanosecond timing, but I found that the overhead of a single call to clock_gettime() was over 250ns. That makes it impossible to time events 100ns long, and having such high overhead on the timer function seriously drags down app performance, distorting the profiles beyond value. (We have hundreds of thousands of profiling nodes per second.)

Is there a way to call clock_gettime() that has less than ¼μs overhead? Or is there some other way that I can reliably get the timestamp counter with <25ns overhead? Or am I stuck with using rdtsc?

Below is the code I used to time clock_gettime().

// calls gettimeofday() to return wall-clock time in seconds:
extern double Get_FloatTime();
enum { TESTRUNS = 1024*1024*4 };

// time the high-frequency timer against the wall clock
{
    double fa = Get_FloatTime();
    timespec spec; 
    clock_getres( CLOCK_PROCESS_CPUTIME_ID, &spec );
    printf("CLOCK_PROCESS_CPUTIME_ID resolution: %ld sec %ld nano\n", 
            spec.tv_sec, spec.tv_nsec );
    for ( int i = 0 ; i < TESTRUNS ; ++ i )
    {
        clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &spec );
    }
    double fb = Get_FloatTime();
    printf( "clock_gettime %d iterations : %.6f msec %.3f microsec / call\n",
        TESTRUNS, ( fb - fa ) * 1000.0, (( fb - fa ) * 1000000.0) / TESTRUNS );
}
// and so on for CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_THREAD_CPUTIME_ID.

Results:

CLOCK_PROCESS_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 3115.784947 msec 0.371 microsec / call
CLOCK_MONOTONIC resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2505.122119 msec 0.299 microsec / call
CLOCK_REALTIME resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2456.186031 msec 0.293 microsec / call
CLOCK_THREAD_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2956.633930 msec 0.352 microsec / call

This is on a standard Ubuntu kernel. The app is a port of a Windows app (where our rdtsc inline assembly works just fine).

Addendum:

Does x86-64 GCC have some intrinsic equivalent to __rdtsc(), so I can at least avoid inline assembly?

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • The answers to this question may be helpful to you: http://stackoverflow.com/questions/638269/timer-to-find-elapsed-time-in-a-function-call-in-c – Steven T. Snyder Oct 28 '11 at 22:36
  • @Crash: [My sympathies](http://stackoverflow.com/questions/7916985/what-is-boilerplate-code-hot-code-and-hot-spots/7923574#7923574) :) Wanna have a bake-off some time? Who can speed up some code the most? – Mike Dunlavey Oct 28 '11 at 23:26
  • @Mike I wish! Right now I'm in more of a "we need to speed up this code by 20% or we're totally screwed" kind of situation. And looking at the function list in a sampling profiler, there's not a one over 2% of the main loop. (I tried your stopwatch-and-debugger-break trick and got twenty different callstacks from twenty different pauses.) – Crashworks Nov 02 '11 at 02:54
  • @Crash: I'm sure you did. What I do is look at each sample and just explain to myself (make a description, on paper or in my head) what the program was doing at that time and why it was doing it. That means paying attention to the source code at each level of the stack. (It might also mean looking at other state information, like related variables.) If there's something that doesn't strictly *have* to be done, and if you see a similar thing on >1 sample, go fix it and get your speedup. Your code could be really tight, but if there's anything to be squeezed out, this should find it. – Mike Dunlavey Nov 02 '11 at 14:29
  • @Crash: Example, bear with me. I often find samples in data structure code, like indexing, incrementing iterators, or testing for end conditions. I could see this on different lines of code in different routines, so no line of code or routine rises to a significant percent. Even just one of those things, like indexing or incrementing, might not rise to a significant percent. But *taken together*, they could. Often plain old arrays, while maybe less orthodox, can save all that time. – Mike Dunlavey Nov 02 '11 at 17:07
  • @Mike I can see from the call graph pretty much what the program is doing -- about half the time is in AI, but each frame the AI does a thousand little things that are microseconds each but add up to a huge chunk. It's just a case where we've got lots of little inefficiencies spread out over hundreds of functions, and need to go through and tighten each one. – Crashworks Nov 02 '11 at 20:05
  • @Mike After being shipped in ten titles across four platforms, our standard container data structures are optimized about as much as they can be! They're all inlined and some of their operations are compiler intrinsics anyway, so they tend not to show up in call stacks other than as parts of other functions. – Crashworks Nov 02 '11 at 20:57
  • @Crash: Well, like I said, we could do a bake-off. I've never seen code that couldn't be made faster, at least a little, unless it was totally limited by something like I/O or calling a system call unavoidably. Maybe yours is the first, but I doubt it. The simple fact that you're using container classes, optimized/inlined or not, raises an eyebrow. All inlining does is make them harder to see; it doesn't make them go away. You gotta look at machine code to see why it's being spent. Is it just as fast as plain old arrays? Worth checking. – Mike Dunlavey Nov 03 '11 at 20:02
  • @MikeDunlavey Oh, I agree. And we will make it faster. I'm just saying that we've already fixed all the big things that took large chunks of time, so now we need to fix a bazillion tiny things that take lots of little chunks of time. But for this project, every dollar we spend on buying server hardware comes out of the bonus budget, so people are well incentivized to optimize! – Crashworks Nov 05 '11 at 22:36
  • @Crash: Sorry to keep going. It sounds like you've got powerful incentives. Now, you said you took 20 stack samples. If you want to send them to me (I'll keep them secret) I'll see if I can show you what I mean (for free). I'll be at a disadvantage not knowing the code, but the details are important - that's where you find the devil. Don't think of it as looking for "places" with high percent. Rather any description you can make of each snapshot is what you work with. This will tell you what the fix is, and you'll know for sure, no guessing. Not saying the fix will be easy. – Mike Dunlavey Nov 06 '11 at 13:53
  • @Crash: I've just posted a concrete example on *[sourceforge](https://sourceforge.net/projects/randompausedemo/files/)*. It shows speedup between 2-3 orders of magnitude over half a dozen iterations, and all the variations are included, along with the stack samples in text files, along with discussion of what they mean. – Mike Dunlavey Nov 27 '11 at 18:22

6 Answers6

8

No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.

Just port the rdtsc assembly you're using.

__inline__ uint64_t rdtsc(void) {
  uint32_t lo, hi;
  __asm__ __volatile__ (      // serialize
  "xorl %%eax,%%eax \n        cpuid"
  ::: "%rax", "%rbx", "%rcx", "%rdx");
  /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}
David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • He said his previous implementation used rdtsc but had [problems](http://msdn.microsoft.com/en-us/library/windows/desktop/ee417693(v=vs.85).aspx) and didn't like having to recalibrate. – Steven T. Snyder Oct 28 '11 at 22:33
  • 5
    Those problems are six years out of date and now part of the dustbin of history. You are unlikely to see a modern server without constant TSC. – David Schwartz Oct 28 '11 at 22:34
  • @David The problems I have are that the tsc frequency seems unreliable on our Linux servers for some reason. On all of our Windows machines, I calibrate the timer once and it remains stable, but when I do this on the Ubuntu boxes I get increasingly strange results. It might be because the processes are migrating between cores a lot more often on the Ubuntu machines, and the different cores have slightly different timestamps. – Crashworks Oct 28 '11 at 22:38
  • 2
    @David Do you have some sources to back that up? How have the timing issues related to multi-CPU synchronization and clock-throttling power saving features changed in the last few years? – Steven T. Snyder Oct 28 '11 at 22:39
  • 5
    If you check /proc/cpuinfo, you should see 'constant_tsc' on every modern CPU. If the TSCs are not synchronized across cores or are not constant when they should be, that's a bug. (You should report it.) – David Schwartz Oct 28 '11 at 23:12
  • 2
    I agree with @DavidSchwartz : e.g. for my sandy bridge box : i see `flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr dca sse4_1 sse4_2 x2apic popcnt lahf_lm ida` constant_tsc set. also you may want to refer to http://download.intel.com/design/processor/manuals/253668.pdf section 16.12.1 "Invariant TSC" – Jay D Jun 07 '12 at 23:25
2

It's hard to give a globally applicable answer because the hardware and software implementation will vary widely.

However, yes, most modern platforms will have a suitable clock_gettime call that is implemented purely in user-space using the VDSO mechanism, and will in my experience take 20 to 30 nanoseconds to complete (but see Wojciech's comment below about contention).

Internally, this is using rdtsc or rdtscp for the fine-grained portion of the time-keeping, plus adjustments to keep this in sync with wall-clock time (depending on the clock you choose) and a multiplication to convert from whatever units rdtsc has on your platform to nanoseconds.

Not all of the clocks offered by clock_gettime will implement this fast method, and it's not always obvious which ones do. Usually CLOCK_MONOTONIC is a good option, but you should test this on your own system.

tink
  • 14,342
  • 4
  • 46
  • 50
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • > and will reliably take something like 20 to 30 nanoseconds to complete - This is not true. Even the vDSO-assisted flavours of clock_gettime() calls for CLOCK_MONOTONIC and CLOCK_REALTIME are *known* to take microseconds to complete in contended scenarios (most production environments in real life). – Wojciech Kudla Nov 30 '20 at 18:08
  • @WojciechKudla - fair enough, I haven't seen it but I can believe it. Is it because there is an atomic operation on the fast path to avoid time going backwards or something? If not how does contention occur? I have weakened the language I used from "reliably" to "in my experience" and added a link to your comment. – BeeOnRope Nov 30 '20 at 19:21
  • contented in this context means there's many requests for current time coming from multiple cores. And yes, you're right - there are some unlikely branches in the vdso implementation of clock_gettime. – Wojciech Kudla Nov 30 '20 at 22:14
  • @WojciechKudla - that doesn't cause a problem: the lines will be shared in S state across all the cores and be accessed at full L1 speed everywhere. Without that, any sharing of r/o or mostly-r/o memory would suck on modern processors! – BeeOnRope Dec 01 '20 at 05:50
  • 1
    I'm afraid this in wrong again. I've been studying this problem for 15 years now working for ultra-low latency shops. The cache line remains in S state for a minimal amount of time before it gets invalidated by the kernel. And it's not just the raw timer values, but also skews, timer domains and some flags. When you have lot's of CPUs aggressively reading a cache line while one other constantly invalidates it, you WILL have a performance bottleneck. This is the very reason why there are things like MCS lock, etc simply because contention destroys performance for shared access. – Wojciech Kudla Dec 01 '20 at 06:39
2

This is what happens when you call clock_gettime() function.

Based on the clock you choose it will call the respective function. (from vclock_gettime.c file from kernel)

int clock_gettime(clockid_t, struct __kernel_old_timespec *)
    __attribute__((weak, alias("__vdso_clock_gettime")));

notrace int
__vdso_clock_gettime_stick(clockid_t clock, struct __kernel_old_timespec *ts)
{
    struct vvar_data *vvd = get_vvar_data();

switch (clock) {
case CLOCK_REALTIME:
    if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
        break;
    return do_realtime_stick(vvd, ts);
case CLOCK_MONOTONIC:
    if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
        break;
    return do_monotonic_stick(vvd, ts);
case CLOCK_REALTIME_COARSE:
    return do_realtime_coarse(vvd, ts);
case CLOCK_MONOTONIC_COARSE:
    return do_monotonic_coarse(vvd, ts);
}
/*
 * Unknown clock ID ? Fall back to the syscall.
 */
    return vdso_fallback_gettime(clock, ts);
}

CLOCK_MONITONIC better (though I use CLOCK_MONOTONIC_RAW) since it is not affected from NTP time adjustment.


This is how the do_monotonic_stick is implemented inside kernel:

notrace static __always_inline int do_monotonic_stick(struct vvar_data *vvar,
                              struct __kernel_old_timespec *ts)
{
    unsigned long seq;
    u64 ns;

    do {
        seq = vvar_read_begin(vvar);
        ts->tv_sec = vvar->monotonic_time_sec;
        ns = vvar->monotonic_time_snsec;
        ns += vgetsns_stick(vvar);
        ns >>= vvar->clock.shift;
    } while (unlikely(vvar_read_retry(vvar, seq)));

    ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
    ts->tv_nsec = ns;

    return 0;
}

And the vgetsns_stick() function which provides nano seconds resolution is implemented as:

notrace static __always_inline u64 vgetsns(struct vvar_data *vvar)
{
    u64 v;
    u64 cycles;

    cycles = vread_tick();
    v = (cycles - vvar->clock.cycle_last) & vvar->clock.mask;
    return v * vvar->clock.mult;
}

Where the function vread_tick() reads the cycles from register based on the CPU:

notrace static __always_inline u64 vread_tick(void)
{
    register unsigned long long ret asm("o4");

    __asm__ __volatile__("rd %%tick, %L0\n\t"
                 "srlx %L0, 32, %H0"
                 : "=r" (ret));
    return ret;
}

A single call to clock_gettime() takes around 20 to 100 nano seconds. reading the rdtsc register and converting the cycles to time is always faster.

I have done some experiment with CLOCK_MONOTONIC_RAW here: Unexpected periodic behaviour of an ultra low latency hard real time multi threaded x86 code

1

Give clockid_t CLOCK_MONOTONIC_RAW a try?

CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific) Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3).

From Man7.org

user2548100
  • 4,571
  • 1
  • 18
  • 18
  • 2
    On x86/x86_64 Linux (at least up to 4.15) the call to read the clock is actually slower with `CLOCK_MONOTONIC_RAW` (because it doesn't use the vDSO and makes a real syscall) in comparison to when `CLOCK_MONOTONIC` is used. See the results in https://stackoverflow.com/a/13096917/9109338 for some data. If this matters make sure you check your platform's behaviour! – Anon Feb 15 '18 at 06:40
1

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.

Have you considered oprofile or perf? You can use the performance counter hardware on your CPU to get profiling data without adding instrumentation to the code itself. You can see data per-function, or even per-line-of-code. The "only" drawback is that it won't measure wall clock time consumed, it will measure CPU time consumed, so it's not appropriate for all investigations.

Brian Cain
  • 14,403
  • 3
  • 50
  • 88
  • If I was just profiling on my bench, those would be fine. But I'm trying to fix an embedded instrumented profiler that we've used for a while, and on which many other of our tools depend. – Crashworks Oct 29 '11 at 20:13
0

You are calling clock_getttime with control parameter which means the api is branching through if-else tree to see what kind of time you want. I know you cant't avoid that with this call, but see if you can dig into the system code and call what the kernal is eventually calling directly. Also, I note that you are including the loop time (i++, and conditional branch).

MartyTPS
  • 530
  • 2
  • 5