4

I am implementing some data structure in which I need to invalidate some entries after some time, so for each entry I need to maintain its insertion timestamp. When I get an entry I need to get a timestamp again and calculate the elapsed time from the insertion (if it's too old, I can't use it).

This data structure is highly contented by many threads, so I must get this timestamp (on insert and find) in the most efficient way possible. Efficiency is extremely important here.

If it matters, I am working on a linux machine, developing in C++. What is the most efficient way to retrieve a timestamp?

BTW, in some old project I was working on, I remember I saw some assembly command which gets a timestamp directly from the CPU (can't remember the command).

michael
  • 530
  • 4
  • 16
  • 1
    linux's `gettimeofday` is highly optimized, just because it is frequently called by common software (for example in "rollback" features in databases, where multiple clients try to write to the same part of the database, you need to know who was first). – Mats Petersson Feb 04 '18 at 15:04
  • Of course, if you want to just invalidate entries after "not forever", you could just increment a counter for each cache-entry, and if it's more than 10, 100, 1000 or 4711 "other cache entry counts" old, you invalidate it. This of course isn't ideal if you are sure that 1 minute old entries will be OK, but those that are more than 5 minutes must be re-loaded. – Mats Petersson Feb 04 '18 at 15:06
  • I'd be very surprised if [std::chrono::system_clock::now()](http://en.cppreference.com/w/cpp/chrono/system_clock/now) was not *more* than fast enough. – Jesper Juhl Feb 04 '18 at 15:08
  • @user202729 `time()` only has 1sec resolution. Fine for many things, but extremely inaccurate for many use cases. In any case, I don't see why one would not just use `std::chrono` in this day and age. – Jesper Juhl Feb 04 '18 at 15:12
  • I am familiar with the methods mentioned above, but not sure which method is the fastest. I looked at `gettimeofday` implementation and it looks like it does more than I need. I believe reading some CPU register will be most efficient. – michael Feb 04 '18 at 15:12
  • Obviously depending on what processor you're using but for many implementations, the `gettimeofday` in Linux is essentially "read one 64-bit value, read time-stamp counter, and some shuffling to make it into seconds a nanoseconds". It's hard to beat that. Reading timestamp counter on its own is tough, because it doesn't provide an actual time, and it isn't guaranteed to be stable between CPUs in a multicore/multiprocessor system. – Mats Petersson Feb 04 '18 at 15:16
  • Note also that the `rdtsc` instruction in x86 is quite slow [because it is serializing - it waits for prior issued instructions to complete], so the "extra" in `gettimeofday` is probably dwarfed by this serialization. – Mats Petersson Feb 04 '18 at 15:18
  • And of course, reading timestamps directly is definitely not portable! :) – Mats Petersson Feb 04 '18 at 15:18
  • 3
    @michael Then try benchmarking. – user202729 Feb 04 '18 at 15:19
  • I don't need the actual time, and I don't need any conversions. I need the insertion timestamp and current timestamp. The timestamp can be cpu ticks or whatever. I can pre-calculate my entry timeout in ticks (based on the cpu rate), and use only ticks. Accuracy is less important for me in this case. – michael Feb 04 '18 at 15:22
  • @user202729 I thought of trying my luck here first, this is my next step. :-) – michael Feb 04 '18 at 15:37
  • [Read this](https://meta.stackoverflow.com/a/261593/5267751)? (Meta StackOverflow) – user202729 Feb 04 '18 at 15:38
  • Assembler could be an idea – deW1 Feb 04 '18 at 17:17

1 Answers1

10

I have created the following benchmark to test several methods for retrieving a timestamp. The benchmark was compiled with GCC with -O2, and tested on my mac. I have measured the time it takes to get 1M timestamps for each method, and from the results it looks like rdtsc is faster than the others.

EDIT: The benchmark was modified to support multiple threads.

Benchmark code:

#include <iostream>
#include <chrono>
#include <sys/time.h>
#include <unistd.h>
#include <vector>
#include <thread>
#include <atomic>

#define NUM_SAMPLES 1000000
#define NUM_THREADS 4

static inline unsigned long long getticks(void)
{
    unsigned int lo, hi;

    // RDTSC copies contents of 64-bit TSC into EDX:EAX
    asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
    return (unsigned long long)hi << 32 | lo;
}

std::atomic<bool> g_start(false);
std::atomic<unsigned int> totalTime(0);

template<typename Method>
void measureFunc(Method method)
{
    // warmup
    for (unsigned int i = 0; i < NUM_SAMPLES; i++)
    {   
        method();
    }

    auto start = std::chrono::system_clock::now();

    for (unsigned int i = 0; i < NUM_SAMPLES; i++)
    {   
        method();
    }

    auto end = std::chrono::system_clock::now();
    totalTime += std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
}

template<typename Method>
void measureThread(Method method)
{
    while(!g_start.load());

    measureFunc(method);
}

template<typename Method>
void measure(const std::string& methodName, Method method)
{
    std::vector<std::thread> threads;

    totalTime.store(0);
    g_start.store(false);

    for (unsigned int i = 0; i < NUM_THREADS; i++)
    {
        threads.push_back(std::thread(measureThread<Method>, method));
    }

    g_start.store(true);

    for (std::thread& th : threads)
    {
        th.join();
    }

    double timePerThread = (double)totalTime / (double)NUM_THREADS;

    std::cout << methodName << ": " << timePerThread << "ms per thread" << std::endl;
}

int main(int argc, char** argv)
{
    measure("gettimeofday", [](){ timeval tv; return gettimeofday(&tv, 0); });
    measure("time", [](){ return time(NULL); });
    measure("std chrono system_clock", [](){ return std::chrono::system_clock::now(); });
    measure("std chrono steady_clock", [](){ return std::chrono::steady_clock::now(); });
    measure("clock_gettime monotonic", [](){ timespec tp; return clock_gettime(CLOCK_MONOTONIC, &tp); });
    measure("clock_gettime cpu time", [](){ timespec tp; return clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &tp); });
    measure("rdtsc", [](){ return getticks(); });

    return 0;
}

Results (in milliseconds) for a single thread:

gettimeofday: 54ms per thread
time: 260ms per thread
std chrono system_clock: 62ms per thread
std chrono steady_clock: 60ms per thread
clock_gettime monotonic: 102ms per thread
clock_gettime cpu time: 493ms per thread
rdtsc: 8ms per thread

With 4 threads:

gettimeofday: 55.25ms per thread
time: 292.5ms per thread
std chrono system_clock: 69.25ms per thread
std chrono steady_clock: 68.5ms per thread
clock_gettime monotonic: 118.25ms per thread
clock_gettime cpu time: 2975.75ms per thread
rdtsc: 10.25ms per thread

From the results, it looks like std::chrono has some small overhead when called from multiple threads, the gettimeofday method stays stable as the number of threads increases.

michael
  • 530
  • 4
  • 16
  • You might also try with `steady_clock`, it's better for cache expiry than `system_clock`, and could be faster since it doesn't have to convert to absolute time. – Ben Voigt Feb 06 '18 at 15:49
  • 1
    Added `steady_clock`, got similar result as `system_clock`. – michael Feb 06 '18 at 16:42
  • Using the raw output of `rdtsc` might not be precise due to (mainly) power state switching going on in the CPU, and various steppings/models/families of x86 CPUs might return different results depending on this. Also keep in mind that `rdtsc` is not a serializing instruction and some instructions might still be executing or queued for execution : thus, you might be reading the timestamp *before* actually executing the thing that you want to measure. – Daniel Kamil Kozar Feb 06 '18 at 16:49
  • 1
    @Daniel Kamil Kozar You are right. When using `rdtsc` you should understand its limitations. In my case, the CPU is always in full power and accuracy is not that of important while performance is critical. – michael Feb 06 '18 at 16:55
  • 1
    If desired, [here is an answer](https://stackoverflow.com/a/11485388/576911) that shows how to wrap `rdtsc` up in a `chrono`-style clock to make it easier to use. This gives you time_points and durations related to your `rdtsc` measurements with no additional overhead. – Howard Hinnant Feb 07 '18 at 15:09
  • Any system overhead during the calibration interval will throw off all your future measurements. `rdtsc` ticks at the rated CPU frequency. [Calculate system time using rdtsc](https://stackoverflow.com/a/42190816) has some info about finding the right frequency programatically, but I'm not sure you can do it without privileged instructions. The model string from CPUID often includes something like `4.0GHz`, but that might not be reliable. – Peter Cordes Aug 18 '18 at 16:24
  • For a portable way to use `rdtsc`, see [Get CPU cycle count?](https://stackoverflow.com/a/51907627). – Peter Cordes Aug 18 '18 at 16:25