2

I am familiar with two approaches, but both of them have their limitations.

The first one is to use the instruction RDTSC. However, the problem is that it doesn't count the number of cycles of my program in isolation and is therefore sensitive to noise due to concurrent processes.

The second option is to use the clock library function. I thought that this approach is reliable, since I expected it to count the number of cycles for my program only (what I intend to achieve). However, it turns out that in my case it measures the elapsed time and then multiplies it by CLOCKS_PER_SEC. This is not only unreliable, but also wrong, since CLOCKS_PER_SEC is set to 1,000,000 which does not correspond to the actual frequency of my processor.

Given the limitation of the proposed approaches, is there a better and more reliable alternative to produce consistent results?

Ivaylo Toskov
  • 3,911
  • 3
  • 32
  • 48
  • 1
    If on linux, check the perf command, and the perf library. – xvan Mar 10 '16 at 18:03
  • Much like rdtsc, the number of cycles doesn't mean much anymore either. Not on a processor core that can speculatively execute instructions. There's a decent processor counter available, the number of retired instructions. A good profiler can show you that number. – Hans Passant Mar 10 '16 at 18:03
  • 1
    Not a recommendation, since I haven't used it, but I have heard good things about [Intel Performance Counter Monitor](https://software.intel.com/en-us/articles/intel-performance-counter-monitor). – jxh Mar 10 '16 at 18:05

3 Answers3

4

A lot here depends on how large an amount of time you're trying to measure.

RDTSC can be (almost) 100% reliable when used correctly. It is, however, of use primarily for measuring truly microscopic pieces of code. If you want to measure two sequences of, say, a few dozen or so instructions apiece, there's probably nothing else that can do the job nearly as well.

Using it correctly is somewhat challenging though. Generally speaking, to get good measurements you want to do at least the following:

  1. Set the code to only run on one specific core.
  2. Set the code to execute at maximum priority so nothing preempts it.
  3. Use CPUID liberally to ensure serialization where needed.

If, on the other hand, you're trying to measure something that takes anywhere from, say, 100 ms on up, RDTSC is pointless. It's like trying to measure the distance between cities with a micrometer. For this, it's generally best to assure that the code in question takes (at least) the better part of a second or so. clock isn't particularly precise, but for a length of time on this general order, the fact that it might only be accurate to, say, 10 ms or so, is more or less irrelevant.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Thank you very much for you answer. Is it correct to assume a rule of thumb: less than 100 ms - `RDTSC`, `clock` otherwise? – Ivaylo Toskov Mar 11 '16 at 13:46
  • @IvayloToskov: There's kind of a "dead zone" in the middle. `RDTSC` is only good for things up to a few microseconds or so. `clock` is usually only good for at least tens of milliseconds. Between those, you typically need some third option (performance counter monitor, the motherboard 1.024 MHz clock, etc.) – Jerry Coffin Mar 11 '16 at 13:52
  • 2
    Don't forget to mention that `rdtsc` / `rdtscp` measure reference cycles, i.e. wall-clock time, not actual CPU cycles. So make sure you have a warm-up loop to give the OS time to ramp the CPU speed up to max turbo before running the code you want to measure. Also note that "turbo" clock speed is faster than the rdtsc counter frequency. e.g. an i5-2500k is rated for 3.3GHz sustained, and that's the rdtsc frequency. It can turbo up to 3.7GHz (out of the box without overclocking), so at max turbo your rdtsc cycle counts will be slightly lower than the actual number of clock cycles. – Peter Cordes Mar 13 '16 at 11:50
2

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES

This system call has explicit controls for:

  • process PID selection
  • whether to consider kernel/hypervisor instructions or not

and it will therefore count the cycles properly even when multiple processes are running concurrently.

See this answer for more details: How to get the CPU cycle count in x86_64 from C++?

perf_event_open.c

#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("%lld\n", count);

    close(fd);
}
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
0

RDTSC is the most accurate way of counting program execution cycles. If you are looking to measure execution performance over time scales where it matters if your thread has been preempted, then you would probably be better served with a profiler (VTune, for instance).

CLOCKS_PER_SECOND/clock() is pretty much a very bad (low performance) way of getting time as compared to RDTSC which has almost no overhead.

If you have a specific issue with RDTSC, I may be able to assist.


re: Comments

Intel Performance Counter Monitor: This is mainly for measuring metrics outside of the processor, such as Memory bandwidth, power usage, PCIe utilization. It does also happen to measure CPU frequency, but it typically is not useful for processor bound application performance.

RDTSC portability: RDTSC is an intel CPU instruction supported by all modern Intel CPU's. On modern CPU's it is based on the uncore frequency of your CPU and somewhat similar across CPU cores, although it is not appropriate if your application is frequently being preempted to different cores (and especially to different sockets). If that is the case you really want to look at a profiler.

Out of order Execution: Yes, things get executed out of order, so this can affect performance slightly, but it still takes time to execute instructions and RDTSC is the best way of measuring that time. It excels in the normal use case of executing Non-IO bound instructions on the same core, and this is really how it is meant to be used. If you have a more complicated use case you really should be using a different tool, but that doesn't negate that rdtsc() can be very useful in analyzing program execution.

Clarus
  • 2,259
  • 16
  • 27
  • [RDTSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) have issues with portability, synchronization between cores, and out-of-order execution of operations. It seems that on a single core, older system this would be true, but not necessarily on newer systems. – callyalater Mar 10 '16 at 18:23
  • Are you kidding? Perf counters are great for CPU-bound microbenchmarks. You can count fused-domain uops (issue/retirement), unfused-domain uops (dispatch), and even uops per port, to see if you're actually bottlenecking on the thing you expect from looking at [Agner Fog's instruction tables](http://agner.org/optimize/). There are of course perf counters for measuring cache misses, branch mispredicts, and other events, so they're also great for investigating memory-bound code. – Peter Cordes Mar 13 '16 at 11:53
  • @Peter: I'd argue that if that is the level of detail you need you are better off with a Profiler, like VTune. I'd also argue that time is usually the most important peformance metric, and that you are needlessly complicating things. – Clarus Mar 15 '16 at 19:02
  • @Claris: VTune is a nice front-end for collecting data with perf counters. Measuring how many cycles a CPU-bound loop takes per iteration is not hard with perf-counters. If you're measuring something that's not dependent on memory latency, then using perf counters takes turbo out of the picture. e.g. see http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes. You get consistent numbers regardless of what clock speed the CPU is at. – Peter Cordes Mar 16 '16 at 00:03