20

I want to get the CPU cycles at a specific point. I use this function at that point:

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    // broken for 64-bit builds; don't copy this code
    return x;
}

(editor's note: "=A" is wrong for x86-64; it picks either RDX or RAX. Only in 32-bit mode will it pick the EDX:EAX output you want. See How to get the CPU cycle count in x86_64 from C++?.)

The problem is that it returns always an increasing number (in every run). It's as if it is referring to the absolute time.

Am I using the functions incorrectly?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user1106106
  • 277
  • 2
  • 7
  • 13
  • 5
    What do you expect? You could also use `clock` or `clock_gettime` ? What is that for? See also http://stackoverflow.com/questions/8586354/linux-time-command-microseconds-or-better-accuracy/8587043#8587043 – Basile Starynkevitch Dec 22 '11 at 10:26
  • Yes, it is referring to the absolute number of cpu cycles. – Gunther Piez Dec 22 '11 at 11:35
  • http://en.wikipedia.org/wiki/Time_Stamp_Counter – Necrolis Dec 22 '11 at 11:41
  • 1
    Side note: Beware that this function reads only the low 32 bits (i.e EAX register) and this will cause it to loop back/overflow every 2^32 cycles. This is on the order of a few seconds on modern CPUs, if your code happens to be in the middle of a loop back, you will get erroneous results. – Ayberk Özgür Nov 02 '14 at 11:24

3 Answers3

34

As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.

The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.
Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.

(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)

I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.

If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.

In reply to the further question in the comment:

The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).

Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.

Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.

clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.

Damon
  • 67,688
  • 20
  • 135
  • 185
  • thank you for the quick reply. I don't see why I would get increasing numbers. let's say, some program is calling my function(that measures cpu ticks till that point), always at the same place(lets say - the 5-th line in the main function). So, every time he runs hes program. my function should give the same number (more or less), and note increasing number... – user1106106 Dec 22 '11 at 10:45
  • It is an increasing number, since it is from a counter probably started at power-on or reboot time. – Basile Starynkevitch Dec 22 '11 at 10:52
  • and another thing - if i use the : clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); return ts.tv_nsec; I don't get increasing number, but almost the same number every run – user1106106 Dec 22 '11 at 10:54
  • 1
    @user1106106: thats cause `RDTSC` is CPU wide, `clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts)` is only at the process level, aka `RDTSC` starts from power-up, `gettime(..)` starts from the process start. – Necrolis Dec 22 '11 at 11:39
  • clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) - gets me time in microseconds resolution, and not in cycles.how do i convert it to cycles? – user1106106 Dec 22 '11 at 13:20
  • 1
    `struct timespec` has __nano__ second, not microsecond resolution (1:1000 difference). Though of course, for a variety of reasons a timer might not run at the full resolution available in the value. `clock_getres` tells you that. For example HPET is required to provide at least 0.1µs (or better) by the spec. Some implementations provide nanoseconds, others don't, and they don't have to. The number you get is still __nano__, however. To get clock cycles from time, you need to multiply with the clock speed. But if it is _really_ clocks you want, use RDTSC in the first place. – Damon Dec 22 '11 at 13:27
  • 4
    **On modern CPUs, RDTSC *does* measure time**, in reference cycles. On CPUs where the CPUID includes tsc_invariant and nonstop_tsc, the `gettimeofday` system call *is* implemented in user-space (VDSO page) in terms of RDTSC. (and so is `clock_gettime` for some clk_id values, I assume). CPU manufacturers decided that having a very-low-overhead timesource was more valuable than having RDTSC as a benchmarking tool, so they changed it, and you will have problems on CPUs from ~2005(?) and later if you *want* to measure cycles with it. But you can use performance counters for that. – Peter Cordes Oct 21 '16 at 20:45
  • @PeterCordes: That's all nice and well, however... my Skylake-gen CPU (which I would consider "modern") _definitively does not_ measure time with RDTSC, nor are cores synchronized. The same code returns the same number of "ticks" regardless of performance level (once with, and once without "warming up" CPU), that is, at higher performance levels the ticks must be shorter. Also, I have experienced "time travel" artefacts (allegedly fixed in ~2005, too, but definitively present now). – Damon Oct 22 '16 at 10:25
  • Hmm, that's surprising. I always just use perf counters, not RDTSC. Is the thing you're testing bottlenecked on RAM? That would explain taking the same amount of real time regardless of CPU frequency, since only L2 and L1 caches scale with core clock speed, not L3 or RAM. Otherwise IDK. [This recent article](https://www.lmax.com/blog/staff-blogs/2015/10/25/time-stamp-counters/) mentions SKL without mentioning any differences for it, and goes into detail about using the TSC as a timesource and working out the conversion from ticks to nanosecs. – Peter Cordes Oct 22 '16 at 11:03
  • Skew between cores is probably from Linux adjusting the TSC (maybe to keep the local clock in sync with an NTP server)? – Peter Cordes Oct 22 '16 at 11:04
  • 3
    this helped me to clear up the confusion from the comments above: https://stackoverflow.com/a/11060619/5242207 – Robin F. Aug 07 '18 at 11:00
22

There's lots of confusing and/or wrong information about the TSC out there, so I thought I'd try to clear some of it up.

When Intel first introduced the TSC (in original Pentium CPUs) it was clearly documented to count cycles (and not time). However, back then CPUs mostly ran at a fixed frequency, so some people ignored the documented behaviour and used it to measure time instead (most notably, Linux kernel developers). Their code broke in later CPUs that don't run at a fixed frequency (due to power management, etc). Around that time other CPU manufacturers (AMD, Cyrix, Transmeta, etc) were confused and some implemented TSC to measure cycles and some implemented it so it measured time, and some made it configurable (via. an MSR).

Then "multi-chip" systems became more common for servers; and even later multi-core was introduced. This led to minor differences between TSC values on different cores (due to different startup times); but more importantly it also led to major differences between TSC values on different CPUs caused by CPUs running at different speeds (due to power management and/or other factors).

People that were trying to use it wrong from the start (people who used it to measure time and not cycles) complained a lot, and eventually convinced CPU manufacturers to standardise on making the TSC measure time and not cycles.

Of course this was a mess - e.g. it takes a lot of code just to determine what the TSC actually measures if you support all 80x86 CPUs; and different power management technologies (including things like SpeedStep, but also things like sleep states) may effect TSC in different ways on different CPUs; so AMD introduced a "TSC invariant" flag in CPUID to tell the OS that the TSC can be used to measure time correctly.

All recent Intel and AMD CPUs have been like this for a while now - TSC counts time and doesn't measure cycles at all. This means if you want to measure cycles you had to use (model specific) performance monitoring counters. Unfortunately the performance monitoring counters are an even worse mess (due to their model specific nature and convoluted configuration).

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • You **can** use it to measure cycles. Just make sure you run the cpu at 100% by loading it with work. – Johan Mar 05 '16 at 22:15
  • 2
    The funny thing is that it counts time in "reference cycles", and runs at the CPU's *rated* clock speed (i.e. if it's sold as 2.4GHz CPU, that's the RDTSC count frequency). To measure core clock cycles, use performance counters to measure `unhalted_core_cycles` or something. – Peter Cordes Oct 21 '16 at 20:35
2

good answers already, and Damon already mentioned this in a way in his answer, but I'll add this from the actual x86 manual (volume 2, 4-301) entry for RDTSC:

Loads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior.

galois
  • 825
  • 2
  • 11
  • 31