faster equivalent of gettimeofday

Question

In trying to build a very latency sensitive application, that needs to send 100s of messages a seconds, each message having the time field, we wanted to consider optimizing gettimeofday. Out first thought was rdtsc based optimization. Any thoughts ? Any other pointers ? Required accurancy of the time value returned is in milliseconds, but it isn't a big deal if the value is occasionally out of sync with the receiver for 1-2 milliseconds. Trying to do better than the 62 nanoseconds gettimeofday takes

Keep in mind the timestamp counter may not be synchronized across CPUs, depending on CPU model. Also, modern Linux will implement `gettimeofday` in userspace with `rdtsc` where possible — bdonlan, Jun 27 '11 at 21:14
Are you sure `gettimeofday()` is a problem? Which OS are you using? On Linux, IIRC, it was moved to userspace (to the vsyscall page, or the vDSO, don't remember which) to allow it to scale to lots of CPUs (was done by SGI's Christoph Lameter, IIRC). — ninjalj, Jun 27 '11 at 21:18
vsyscall had a gettimeofday, but vsyscall has been obsoleted, and its gettimeofday is now just a stub that calls into the kernel. — bdonlan, Jun 27 '11 at 21:20
@Humble, why not just measure the amount of time `gettimeofday()` takes to see if it's fast enough for you? And how fast does it need to be? — bdonlan, Jun 29 '11 at 15:22
I wanted to add a datapoint that **YES**, `gettimeofday` *is demonstrably slow*. Currently adding real-time ETA routines to a problem I have to brute-force with billion-total-iteration nested for loops... on my i3 machine, just doing some counter incrementation with occasional status calls I get 78M ops/sec; with one `gettimeofday` that becomes 28M, with two (for start= and end=) it becomes 16M. — i336_, Mar 10 '18 at 02:14

David Terei · Answer 1 · 2015-07-10T00:56:22.623

71

POSIX Clocks

I wrote a benchmark for POSIX clock sources:

time (s) => 3 cycles
ftime (ms) => 54 cycles
gettimeofday (us) => 42 cycles
clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
rdtsc (cycles) => 24 cycles

These numbers are from an Intel Core i7-4771 CPU @ 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.

You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).

TSC

Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.

The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).

Memcached

Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.

edited Jul 10 '15 at 00:56

answered Oct 27 '12 at 03:14

David Terei

1,985
16
18

On your github link you have the same results, but in nanoseconds, differing from what you write here by factor 1000. – nh2 May 03 '13 at 04:38
sorry, fixed time notation. – David Terei Jun 21 '13 at 22:32
Your cached_clock method is just hiding in another thread the time consumed by the measuring. Plus it's far way inaccurate compared to other methods. See, if your main loop goes well and current_time==2, you're adding 2 msec each time you loop. – Gull_Code Sep 19 '13 at 14:23
@Gull_Code Sure, it is hiding the cost in another thread, saying 0ns is not accurate at all. The 'cost' of it though really depends on your application. Point of it being that you pay the cost once per cycle of the granularity you care about, not every time that you need the time. And yes, accuracy isn't great but once again depends a lot on use case. For something like memcached it's perfectly fine. For a paxos implementation it's probably terrible. – David Terei Oct 10 '13 at 01:17
2

How can you even benchmark with nanosecond accuracy? Is there a way to ensure that your program is the only one executing and that no context switches are allowed? – Lundin Feb 03 '14 at 12:45
1

@Lundin you run a **LOT** of rounds - enough where the context switches factor out. – haneefmubarak Jun 08 '15 at 03:43
@haneefmubarak My comment was actually a rhetorical question, hinting that none of this makes any sense unless the program is running on a system with real-time performance. – Lundin Jun 08 '15 at 06:15
I think taking the minimum value is nonsense. The caches, branch predictors will be warmed up by the repeated calls. However, when you call these calls in real use, chances are they will called very rarely, and in those cases, the caches will be most likely cold. (in respect to reading the time). – Chang Hyun Park Sep 12 '16 at 04:52
I'm not sure it's plausible for `time` to take only 3 clock cycles on your Haswell. Or maybe it's reading a shared counter from memory instead of using `rdtsc`? As you found `rdtsc` has a throughput of 1 per 24 cycles on Haswell. (https://agner.org/optimize/). Did you check the compiler-generated asm to make sure it didn't optimize away some work? – Peter Cordes Aug 18 '18 at 15:56
I just checked, and `time(NULL)` does in fact jump (via the PLT) to code in the VDSO page which just loads a value from memory. `mov rax,QWORD PTR [rip+0xffffffffffffc1fd] # 0x7ffff7ff70a8`. The code comes from [`__vdso_time()` in `arch/x86/entry/vdso/vclock_gettime.c`](https://github.com/torvalds/linux/blob/dd53a4214d4ff450b66ca7d2e51d9369e3266ebf/arch/x86/entry/vdso/vclock_gettime.c#L318). If that could inline to a `movabs` and avoid the PLT + call + branch overhead, we'd have 2-per-clock throughput for reading the time with 1-second resolution :P – Peter Cordes Aug 19 '18 at 09:45
1

@Peter - clock_gettime(CLOCK_MONOTONIC_COARSE) is also "faster than rdtsc" and also reads from a memory location in the VDSO. It does a bit more math though so it ends up quite a bit more expensive than time(), but is sometimes much more useful since it has a higher resolution. It is a shame it isn't even faster, although you can always "roll your own" with a periodic signal (or thread that sleeps) that updates a shared memory location - then you really can have your 1 uop reads of a high(ish) resolution clock. – BeeOnRope Aug 19 '18 at 14:35
On Linux 5.3 and above kernels calling `clock_gettime(CLOCK_MONOTONIC_RAW)` should take the same number of cycles `clock_gettime(CLOCK_MONOTONIC)` thanks to [\[PATCH v7 00/25\] Unify vDSOs across more architectures](https://lore.kernel.org/linux-arm-kernel/20190621095252.32307-1-vincenzo.frascino@arm.com/). – Anon Oct 31 '19 at 06:01

bdonlan · Accepted Answer · 2011-06-27T21:42:03.967

50

Have you actually benchmarked, and found gettimeofday to be unacceptably slow?

At the rate of 100 messages a second, you have 10ms of CPU time per message. If you have multiple cores, assuming it can be fully parallelized, you can easily increase that by 4-6x - that's 40-60ms per message! The cost of gettimeofday is unlikely to be anywhere near 10ms - I'd suspect it to be more like 1-10 microseconds (on my system, microbenchmarking it gives about 1 microsecond per call - try it for yourself). Your optimization efforts would be better spent elsewhere.

While using the TSC is a reasonable idea, modern Linux already has a userspace TSC-based gettimeofday - where possible, the vdso will pull in an implementation of gettimeofday that applies an offset (read from a shared kernel-user memory segment) to rdtsc's value, thus computing the time of day without entering the kernel. However, some CPU models don't have a TSC synchronized between different cores or different packages, and so this can end up being disabled. If you want high performance timing, you might first want to consider finding a CPU model that does have a synchronized TSC.

That said, if you're willing to sacrifice a significant amount of resolution (your timing will only be accurate to the last tick, meaning it could be off by tens of milliseconds), you could use CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE with clock_gettime. This is also implemented with the vdso as well, and guaranteed not to call into the kernel (for recent kernels and glibc).

edited Jun 27 '11 at 21:42

answered Jun 27 '11 at 21:18

bdonlan

224,562
31
268
324

Each process is single threaded. The server will typically have 10-20 such processes running. – Humble Debugger Jun 29 '11 at 12:38
1

"CPU model that does have a synchronized TSC", have a Xeon 5680, will research on it's handling of this – Humble Debugger Jun 29 '11 at 12:41
2

@Humble, check for "Marking TSC unstable" in your dmesg. If it's there, you're not using TSC. But always, always benchmark before you try to optimize. Not only do you not know if it's fast enough to start, if you don't benchmark, you'll never know if you make an improvement... – bdonlan Jun 29 '11 at 15:23
@bdonlan `dmesg | grep TSC` says `Fast TSC calibration using PIT` – Humble Debugger Jun 29 '11 at 15:38
Chances are that, if that's all you see, you're using the TSC. But still, benchmark `gettimeofday()`. Is it fast enough? – bdonlan Jun 29 '11 at 15:56
1

Getting about 178 cycles for `gettimeofday()`, so about 0.06 microseconds per call. – Humble Debugger Jun 29 '11 at 16:31
@Humble, well, that's a lot faster than mine (my laptop, it turns out, has an unstable TSC and calls into the kernel to use the HPET) - Hopefully 178 cycles is fast enough for you? :) – bdonlan Jun 29 '11 at 17:02
We had benchmarks showing `gettimeofday()` with `7us` per call and `clock_gettime()` with `9us` per call. However, as you say, `gettimeofday()` could give wrong timing up to `10ms`, I think it's reasonable to accept `gettimeofday()` even it's `2us` slower – Viet Jul 27 '12 at 08:47
@Viet you've got a typo I think – Steven Lu Dec 30 '13 at 04:53
@bdonlan your "check it for yourself" program does not compile on clang (so I can't run it on OS X) – Steven Lu Dec 30 '13 at 04:58
@bdonlan can you expand on why `clock_gettime()` is significantly less accurate than `gettimeofday()`? What is meant by "accurate to the last tick"? The reason this is relevant is because on my system Qt uses `clock_gettime()` to implement timer events, presumably because it's monotonic and not wall time. Why can't `clock_gettime()` be as accurate as `gettimeofday()`? – MuchToLearn Nov 29 '16 at 20:39

score 5 · Answer 3 · answered Jun 27 '11 at 21:43

Like bdonian says, if you're only sending a few hundred messages per second, gettimeofday is going to be fast enough.

However, if you were sending millions of messages per second, it might be different (but you should still measure that it is a bottleneck). In that case, you might want to consider something like this:

have a global variable, giving the current timestamp in your desired accuracy
have a dedicated background thread that does nothing except update the timestamp (if timestamp should be updated every T units of time, then have the thread sleep some fraction of T and then update the timestamp; use real-time features if you need to)
all other threads (or the main process, if you don't use threads otherwise) just reads the global variable

The C language does not guarantee that you can read the timestamp value if it is larger than sig_atomic_t. You could use locking to deal with that, but locking is heavy. Instead, you could use a volatile sig_atomic_t typed variable to index an array of timestamps: the background thread updates the next element in the array, and then updates the index. The other threads read the index, and then read the array: they might get a tiny bit out-of-date timestamp (but they get the right one next time), but they do not run into the problem where they read the timestamp at the same time it is being updated, and get some bytes of the old value and some of the new value.

But all this is much overkill for just hundreds of messages per second.

"have a dedicated background thread that does nothing except update the timestamp (if timestamp should be updated every T units of time" <-- this is exactly what CLOCK_*_COARSE does, except the dedicated thread is actually an interrupt handler and is system-wide, and the kernel folks have already dealt with the read tearing and other issues for you :) — bdonlan, Jun 27 '11 at 21:47
I'm not sure that would be faster than Linux's `gettimeofday()`: every write would potentially cause a cache miss on every reader on SMP. — ninjalj, Jun 27 '11 at 21:50
Come to think of it, are vvars cpu-local on Linux? If so, that's be another major advantage of CLOCK_*_COARSE... Edit: Looks like not (http://lxr.linux.no/linux+v2.6.39/arch/x86/kernel/vsyscall_64.c#L76), but invalidating a cache line or two is better than interrupting all CPUs with a local timer interrupt or IPI I suppose — bdonlan, Jun 27 '11 at 21:50
Lars, it is not a question of how many times a second, the application wants to construct a message and send it out as soon as possible to the receiver, and is competing with other senders. This is a trading application, so in every message to the receiver, no matter how low or high the frequency we would like to shave off microseconds. — Humble Debugger, Jun 29 '11 at 12:37

score 1 · Answer 4 · edited May 23 '17 at 11:54

Below is a benchmark. I see about 30ns. printTime() from rashad How to get current time and date in C++?

#include <string>
#include <iostream>
#include <sys/time.h>
using namespace std;

void printTime(time_t now)
{
    struct tm  tstruct;
    char       buf[80];
    tstruct = *localtime(&now);
    strftime(buf, sizeof(buf), "%Y-%m-%d.%X", &tstruct);
    cout << buf << endl;
}

int main()
{
   timeval tv;
   time_t tm;

   gettimeofday(&tv,NULL);
   printTime((time_t)tv.tv_sec);
   for(int i=0; i<100000000; i++)
        gettimeofday(&tv,NULL);
   gettimeofday(&tv,NULL);
   printTime((time_t)tv.tv_sec);

   printTime(time(NULL));
   for(int i=0; i<100000000; i++)
        tm=time(NULL);
   printTime(time(NULL));

   return 0;
}

3 sec for 100,000,000 calls or 30ns;

2014-03-20.09:23:35
2014-03-20.09:23:38
2014-03-20.09:23:38
2014-03-20.09:23:41

score 0 · Answer 5 · answered Jun 27 '11 at 21:10

0

Do you need the millisecond precision? If not you could simply use time() and deal with the unix timestamp.

answered Jun 27 '11 at 21:10

Vinicius Kamakura

7,665
1
29
43

Comparison of `time()` and `gettimeofday()`, 60 nanoseconds versus 62 nanoseconds. Not much, need to do much better. – Humble Debugger Jun 29 '11 at 18:43
1

Maybe having a thread with: `global_unix_ts = time(); sleep 500ms;`. The global var not even protected by a mutex. This should be lighting fast. bdonlan's answers seems to be very elegant and complete too. – Vinicius Kamakura Jun 29 '11 at 18:47

faster equivalent of gettimeofday

5 Answers5

POSIX Clocks

TSC

Memcached

Linked