3

I have used the below code to get the clock cycle of the processor

unsigned long long rdtsc(void)
{
  unsigned hi, lo;
  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

I get some value say 43, but what is the unit here? Is it in microseconds or nanoseconds.

I used below code to get the frequency of my board.

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
1700000

I also used below code to find my processor speed

dmidecode -t processor | grep "Speed"
Max Speed: 3700 MHz
Current Speed: 3700 MHz

Now how do I use above frequency and convert it to microseconds or milliseconds?

Sam Protsenko
  • 14,045
  • 4
  • 59
  • 75
arceus
  • 327
  • 4
  • 15
  • 2
    It's a cycle counter, not a real-time counter. if your OS does CPU frequency changes to conserve power (for instance), this won't be a stable clock. See [the Wikipedia page](https://en.wikipedia.org/wiki/Time_Stamp_Counter). – unwind Feb 24 '17 at 10:20
  • Its ok but is it in microseconds or nanoseconds? – arceus Feb 24 '17 at 12:13
  • @arceus: Neither. It is a clock cycle counter, so it is in units of 1/cpu_frequency. If the clock is running at a fixed 3700 MHz rate, then each clock cycle is 1/3,700,000,000 seconds long, or (1.0/3.7) nanoseconds (about 0.27027 nanoseconds). As unwind already mentioned, the CPU frequency is not necessarily stable. Finally, if you `#include `, you can just use `__rdtsc()`. (If you limit to GCC, just use `__builtin_ia32_rdtsc()` -- it works on both x86 and x86-64.) – Nominal Animal Feb 24 '17 at 17:53
  • @NominalAnimal Can you please provide an example where `rdtsc` intrinsic is used? – arceus Feb 27 '17 at 05:39
  • @unwind Is there any real-time counter that I can use during early boot time for x86? – arceus Feb 27 '17 at 06:40

1 Answers1

4

A simple answer to the stated question, "how do I convert the TSC frequency to microseconds or milliseconds?" is: You do not. What the TSC (Time Stamp Counter) clock frequency actually is, varies depending on the hardware, and may vary during runtime on some. To measure real time, you use clock_gettime(CLOCK_REALTIME) or clock_gettime(CLOCK_MONOTONIC) in Linux.

As Peter Cordes mentioned in a comment (Aug 2018), on most current x86-64 architectures the Time Stamp Counter (accessed by the RDTSC instruction and __rdtsc() function declared in <x86intrin.h>) counts reference clock cycles, not CPU clock cycles. His answer to a similar question in C++ is valid for C also in Linux on x86-64, because the compiler provides the underlying built-in when compiling C or C++, and rest of the answer deals with the hardware details. I recommend reading that one, too.

The rest of this answer assumes the underlying issue is microbenchmarking code, to find out how two implementations of some function compare to each other.


On x86 (Intel 32-bit) and x86-64 (AMD64, Intel and AMD 64-bit) architectures, you can use __rdtsc() from <x86intrin.h> to find out the number of TSC clock cycles elapsed. This can be used to measure and compare the number of cycles used by different implementations of some function, typically a large number of times.

Do note that there are hardware differences as to how the TSC clock is related to CPU clock. The abovementioned more recent answer goes into some detail on that. For practical purposes in Linux, it is sufficient in Linux to use cpufreq-set to disable frequency scaling (to ensure the relationship between the CPU and TSC frequencies does not change during microbenchmarking), and optionally taskset to restrict the microbenchmark to specific CPU core(s). That ensures that the results gathered in that microbenchmark yield results that can be compared to each other.

(As Peter Cordes commented, we also want to add _mm_lfence() from <emmintrin.h> (included by <immintrin.h>). This ensures that the CPU does not internally reorder the RDTSC operation compared to the function to be benchmarked. You can use -DNO_LFENCE at compile time to omit those, if you want.)

Let's say you have functions void foo(void); and void bar(void); that you wish to compare:

#include <stdlib.h>
#include <x86intrin.h>
#include <stdio.h>

#ifdef    NO_LFENCE
#define   lfence()
#else
#include <emmintrin.h>
#define   lfence()  _mm_lfence()
#endif

static int cmp_ull(const void *aptr, const void *bptr)
{
    const unsigned long long  a = *(const unsigned long long *)aptr;
    const unsigned long long  b = *(const unsigned long long *)bptr;
    return (a < b) ? -1 :
           (a > b) ? +1 : 0;
}

unsigned long long *measure_cycles(size_t count, void (*func)())
{
    unsigned long long  *elapsed, started, finished;
    size_t               i;

    elapsed = malloc((count + 2) * sizeof elapsed[0]);
    if (!elapsed)
        return NULL;

    /* Call func() count times, measuring the TSC cycles for each call. */
    for (i = 0; i < count; i++) {
        /* First, let's ensure our CPU executes everything thus far. */
        lfence();
        /* Start timing. */
        started = __rdtsc();
        /* Ensure timing starts before we call the function. */
        lfence();
        /* Call the function. */
        func();
        /* Ensure everything has been executed thus far. */
        lfence();
        /* Stop timing. */
        finished = __rdtsc();
        /* Ensure we have the counter value before proceeding. */
        lfence();

        elapsed[i] = finished - started;
    }

    /* The very first call is likely the cold-cache case,
       so in case that measurement might contain useful
       information, we put it at the end of the array.
       We also terminate the array with a zero. */
    elapsed[count] = elapsed[0];
    elapsed[count + 1] = 0;

    /* Sort the cycle counts. */
    qsort(elapsed, count, sizeof elapsed[0], cmp_ull);

    /* This function returns all cycle counts, in sorted order,
       although the median, elapsed[count/2], is the one
       I personally use. */
    return elapsed;
}

void benchmark(const size_t count)
{
    unsigned long long  *foo_cycles, *bar_cycles;

    if (count < 1)
        return;

    printf("Measuring run time in Time Stamp Counter cycles:\n");
    fflush(stdout);

    foo_cycles = measure_cycles(count, foo);
    bar_cycles = measure_cycles(count, bar);

    printf("foo(): %llu cycles (median of %zu calls)\n", foo_cycles[count/2], count);
    printf("bar(): %llu cycles (median of %zu calls)\n", bar_cycles[count/2], count);

    free(bar_cycles);
    free(foo_cycles);
}

Note that the above results are very specific to the compiler and compiler options used, and of course on the hardware it is run on. The median number of cycles can be interpreted as "the typical number of TSC cycles taken", because the measurement is not completely reliable (may be affected by events outside the process; for example, by context switches, or by migration to another core on some CPUs). For the same reason, I don't trust the minimum, maximum, or average values.

However, the two implementations' (foo() and bar()) cycle counts above can be compared to find out how their performance compares to each other, in a microbenchmark. Just remember that microbenchmark results may not extend to real work tasks, because of how complex tasks' resource use interactions are. One function might be superior in all microbenchmarks, but poorer than others in real world, because it is only efficient when it has lots of CPU cache to use, for example.

 

In Linux in general, you can use the CLOCK_REALTIME clock to measure real time (wall clock time) used, in the very same manner as above. CLOCK_MONOTONIC is even better, because it is not affected by direct changes to the realtime clock the administrator might make (say, if they noticed the system clock is ahead or behind); only drift adjustments due to NTP etc. are applied. Daylight savings time or changes thereof does not affect the measurements, using either clock. Again, the median of a number of measurements is the result I seek, because events outside the measured code itself can affect the result.

For example:

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#ifdef   NO_LFENCE
#define  lfence()
#else
#include <emmintrin.h>
#define  lfence() _mm_lfence()
#endif

static int cmp_double(const void *aptr, const void *bptr)
{
    const double a = *(const double *)aptr;
    const double b = *(const double *)bptr;
    return (a < b) ? -1 :
           (a > b) ? +1 : 0;
}

double median_seconds(const size_t count, void (*func)())
{
    struct timespec started, stopped;
    double         *seconds, median;
    size_t          i;

    seconds = malloc(count * sizeof seconds[0]);
    if (!seconds)
        return -1.0;

    for (i = 0; i < count; i++) {
        lfence();
        clock_gettime(CLOCK_MONOTONIC, &started);
        lfence();
        func();
        lfence();
        clock_gettime(CLOCK_MONOTONIC, &stopped);
        lfence();
        seconds[i] = (double)(stopped.tv_sec - started.tv_sec)
                   + (double)(stopped.tv_nsec - started.tv_nsec) / 1000000000.0;
    }

    qsort(seconds, count, sizeof seconds[0], cmp_double);
    median = seconds[count / 2];
    free(seconds);
    return median;
}

static double realtime_precision(void)
{
    struct timespec t;

    if (clock_getres(CLOCK_REALTIME, &t) == 0)
        return (double)t.tv_sec
             + (double)t.tv_nsec / 1000000000.0;

    return 0.0;
}

void benchmark(const size_t count)
{
    double median_foo, median_bar;
    if (count < 1)
        return;

    printf("Median wall clock times over %zu calls:\n", count);
    fflush(stdout);

    median_foo = median_seconds(count, foo);
    median_bar = median_seconds(count, bar);

    printf("foo(): %.3f ns\n", median_foo * 1000000000.0);
    printf("bar(): %.3f ns\n", median_bar * 1000000000.0);

    printf("(Measurement unit is approximately %.3f ns)\n", 1000000000.0 * realtime_precision());
    fflush(stdout);
}

 

In general, I personally prefer to compile the benchmarked function in a separate unit (to a separate object file), and also benchmark a do-nothing function to estimate the function call overhead (although it tends to give an overestimate for the overhead; i.e. yield too large an overhead estimate, because some of the function call overhead is latencies and not actual time taken, and some operations are possible during those latencies in the actual functions).

It is important to remember that the above measurements should only be used as indications, because in a real world application, things like cache locality (especially on current machines, with multi-level caching, and lots of memory) hugely affect the time used by different implementations.

For example, you might compare the speeds of a quicksort and a radix sort. Depending on the size of the keys, the radix sort requires rather large extra arrays (and uses a lot of cache). If the real application the sort routine is used in does not simultaneously use a lot of other memory (and thus the sorted data is basically what is cached), then a radix sort will be faster if there is enough data (and the implementation is sane). However, if the application is multithreaded, and the other threads shuffle (copy or transfer) a lot of memory around, then the radix sort using a lot of cache will evict other data also cached; even though the radix sort function itself does not show any serious slowdown, it may slow down the other threads and therefore the overall program, because the other threads have to wait for their data to be re-cached.

This means that the only "benchmarks" you should trust, are wall clock measurements used on the actual hardware, running actual work tasks with actual work data. Everything else is subject to many conditions, and are more or less suspect: indications, yes, but not very reliable.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • 1
    `rdtsc` doesn't count core clock cycles, it counts *reference* cycles (on CPUs from the last decade at least). See [CPU TSC fetch operation especially in multicore-multi-processor environment](https://stackoverflow.com/q/10921210), and also links in my attempt at a canonical Q&A on using `__rdtsc` instead of inline asm on all the major compilers. And stuff about using it, e.g. using `_mm_lfence()` to stop out-of-order execution from reordering `rdtsc`, important for very short intervals. – Peter Cordes Aug 18 '18 at 11:54
  • @PeterCordes: True; that means my comment to the question at hand is incorrect (please add a comment there to point that out?), and really warrants a clarifying paragraph at the beginning of this answer. (I'd like to verify AMD CPUs too (I believe they have the same behaviour, but would like some AMD references); note `cpufreq-set` use to avoid frequency scaling; and add `_mm_lfence()`s, too.) Where is that Q&A? Can't see it there; would love to read that. – Nominal Animal Aug 19 '18 at 03:02
  • Oops, left out the link. [Get CPU cycle count?](https://stackoverflow.com/a/51907627). Re: LFENCE on AMD CPUs, see [Is LFENCE serializing on AMD processors?](https://stackoverflow.com/q/51844886). Oh, you meant sync between cores on AMD. That I don't know. – Peter Cordes Aug 19 '18 at 04:06
  • (@PeterCordes: No, LFENCE is clear. I meant whether the RDTSC peculiarities affect AMD as well as Intel; all RDTSC-related references I've seen thus far reference Intel only.) – Nominal Animal Aug 19 '18 at 04:08
  • I'm almost certain that modern AMD (like Bulldozer at least) has constant-frequency / nonstop TSC, making it a usable timesource for `clock_gettime()` or whatever. I don't know about invariant across cores in a socket, but I'd guess so. – Peter Cordes Aug 19 '18 at 04:10
  • @PeterCordes: I'm similarly almost certain about that, I just would like to find some AMD references to verify. As you know, and as you can see above, I'm often wrong, but whenever I find I am, I do try hard to correct it. But, because it's not easy on ones ego, I try to make it more positive by learning more about the issue; to know I got at least a small bit better. – Nominal Animal Aug 19 '18 at 04:15
  • Heh, yeah I feel the same way. Until you mentioned it, I hadn't even realized I'd just been assuming this all applied to AMD as well. A Linux `/proc/cpuinfo` dump from a Ryzen (https://gist.github.com/hiroi10/70c972730e4057946f36544b0547e467) says it has `constant_tsc` and `nonstop_tsc`, and `tsc_scale`. Compare to that, my Skylake is missing `tsc_scale`, but has `tsc_adjust` and `tsc_known_freq`, and `tsc_deadline_timer`. (Maybe it's just a newer kernel reporting more things in flags, though.) Neither mentions invariant tsc. – Peter Cordes Aug 19 '18 at 04:24
  • Found [a Piledriver FX-8350](https://ubuntuforums.org/showthread.php?t=2204498): it has constant_tsc and nonstop_tsc. – Peter Cordes Aug 19 '18 at 04:29
  • 1
    @PeterCordes: Edited my answer. If you find any parts you disagree with, do let me know. I decided that rather than delve into the details you have already provided in your answer, I just link to yours, and keep to what I believe is the underlying problem here that caused the OP to pose this question, and look at it as if the question were *"how to microbenchmark code in Linux properly"*. That is how the answer originally approached the question, and since the OP accepted it as such, I assume it was useful. – Nominal Animal Aug 19 '18 at 05:47