3
 __inline__ uint64_t rdtsc() {
    uint32_t low, high;
    __asm__ __volatile__ (
        "xorl %%eax,%%eax \n    cpuid"
        ::: "%rax", "%rbx", "%rcx", "%rdx" );
    __asm__ __volatile__ (
                          "rdtsc" : "=a" (low), "=d" (high));
    return (uint64_t)high << 32 | low;
}

I have used the above rdtsc function as a timer in my program: The following code results in 312-344 clock cycles:

 start = rdtsc();
 stop = rdtsc();

 elapsed_ticks = (unsigned)((stop-start));
 printf("\n%u ticks\n",elapsed_ticks);

every time I run the above code I get different values. Why is that?

I ran the same code in Visual C++ which uses an rdtsc function in "intrin.h". I was getting a constant value of 18 clocks.Yes, it was constant on every run! Can someone please explain? Thanks!

semantic_c0d3r
  • 819
  • 2
  • 15
  • 31
  • You don't need inline asm. [Get CPU cycle count?](https://stackoverflow.com/a/51907627) has intrinsics, and some details about the caveats. – Peter Cordes Aug 18 '18 at 11:35

2 Answers2

7

It's quite difficult to get reliable timestamps using the TSC. The main problems are:

  • on older multi-cored processors, the rate could change differently on different cores, as they scaled their clock speeds according to different loads;
  • on more recent processors, the rate remains constant while the clock speed changes, so that timings on a lightly-loaded core may seem slower than they are.
  • out-of-order execution may mean that the register isn't read when you think it is.

Your function is executing the cpuid instruction and ignoring its result, as well as reading the TSC, to try to mitigate the last issue. That's a serialising instruction, which forces in-order execution. However, it's also rather a slow instruction, so will affect the result if you try to measure an extremely short time.

If I remove that instruction from the function to make it equivalent to the intrinsic you're using in VC++:

inline uint64_t rdtsc() {
    uint32_t low, high;
    asm volatile ("rdtsc" : "=a" (low), "=d" (high));
    return (uint64_t)high << 32 | low;
}

then I get more consistent values, but reintroduce the potential instruction-ordering issue.

Also, make sure you're compiling with optimisation (e.g. -O3 if you're using GCC), otherwise the function may not be inlined.

Mike Seymour
  • 249,747
  • 28
  • 448
  • 644
  • thanks for the reply, what optimizations should I apply at compile time? – semantic_c0d3r Oct 16 '13 at 09:37
  • @sanjay_c0d3r: Assuming you're using GCC, then `-O3` will enable (more or less) all the useful optimiations. But I've just done a bit more research to find out why that instruction was there, and it sounds like you'll need to be very careful if you want accurate results from the TSC register. See my updated answer. – Mike Seymour Oct 16 '13 at 09:42
0

Because your process is not the only one running on the system. It may be preempted at any time, causing your process to go to sleep for a little while.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • I'd think that the process being preempted would take a lot more than 32 clocks (344-312). – interjay Oct 16 '13 at 09:17
  • how can you explain the same for the rdtsc function in "intrin.h" library. Why is it always constant? Also, is there a way to run a program in linux disabling the preemption? – semantic_c0d3r Oct 16 '13 at 09:25
  • @sanjay_c0d3r Are you saying you include intrin.h, and using its rdtsc macro/function, everything is as expected ? Then the assembly you've made probably have some problems. – nos Oct 16 '13 at 09:29
  • 1
    @sanjay_c0d3r: Even if you think it works, you shouldn't use rtdsc on modern systems. on windows use queryperformancecounter. – SigTerm Oct 16 '13 at 09:31