4

The following piece of code was given to us from our instructor so we could measure some algorithms performance:

#include <stdio.h>
#include <unistd.h>

static unsigned cyc_hi = 0, cyc_lo = 0;

static void access_counter(unsigned *hi, unsigned *lo) {
    asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
    : "=r" (*hi), "=r" (*lo)
    : /* No input */
    : "%edx", "%eax");
}

void start_counter() {
    access_counter(&cyc_hi, &cyc_lo);
}

double get_counter() {
    unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
    double result;

    access_counter(&ncyc_hi, &ncyc_lo);

    lo = ncyc_lo - cyc_lo;
    borrow = lo > ncyc_lo;
    hi = ncyc_hi - cyc_hi - borrow;

    result = (double) hi * (1 << 30) * 4 + lo;

    return result;
}

However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:

int main(void)
{
    double c1, c2;

    start_counter();

    c1 = get_counter();
    sleep(1);
    c2 = get_counter();

    printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
    printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);

    return 0;
}

The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.

On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.

Any ideas why this is happening and how can I fix it?

rfgamaral
  • 16,546
  • 57
  • 163
  • 275
  • 2
    Though the question was (a bit) different, most of my answer at: http://stackoverflow.com/questions/2658699/measure-time-to-execute-single-instruction/2658833#2658833 applies here. – Jerry Coffin May 11 '10 at 21:32
  • @Jerry Coffin I can't see how your answer on that question helps me. But than again, I didn't understand most of what you wrote lol. – rfgamaral May 11 '10 at 21:36
  • @Tim Post: His problem is not (at least exclusively) VMWare -- it's that RDTSC can be executed out of order, so without executing a serializing instruction (typically `CPUID`) it produces nearly meaningless results. – Jerry Coffin May 11 '10 at 21:38
  • @Jerry Coffin However, if I use `clock()` from `time.h` instead of `get_counter()` I still get a frequency of 0. So, my issue is a little bit different from what you are talking about (I think). – rfgamaral May 11 '10 at 21:42
  • 1
    Write a small loop and repeatedly print the output of get_counter(). Make sure it's actually counting. – ajs410 May 11 '10 at 21:48
  • @Jerry Coffin - Perhaps another answer might help. I too am having a very difficult time explaining why the given (non-serialzed) code is working consistently on my 32 bit desktop. I just triple checked the ASM output of the above code, and no 'magic' serialization code was inserted by gcc. – Tim Post May 11 '10 at 21:49
  • The original code already *was* portable to machines with different CPU frequencies - since it counts *cycles*, it should give the same answer for the same algorithm executed on the same architecture at different clock speeds (modulo scheduling latencies). – caf May 11 '10 at 22:32
  • @caf Yes, the cycle count is portable but to measure execution time in seconds (or whatever), I need the CPU frequency and I was trying to calculate it using the code given. That's my real problem, calculating the CPU frequency of the machine where the code is running. – rfgamaral May 11 '10 at 22:50
  • 2
    If you want to measure wallclock time, don't measure cycles and try to convert to time; just directly measure time (eg with `gettimeofday()`, or `clock_gettime()` with either `CLOCK_MONOTONIC` or `CLOCK_PROCESS_CPUTIME_ID`). – caf May 12 '10 at 00:58
  • I'm using what our instructor recommended us. `gettimeofday()` doesn't have enough precision and when I tested it, the `tv_sec` variable on the structure was correct but not the `tv_usec`, it gave me really weird results. Never head of `clock_gettime()`, will look into that. But how do I justify using it instead of what was recommended? – rfgamaral May 12 '10 at 11:45
  • Measuring `clock_gettime()` with `CLOCK_MONOTONIC` after and before the operation and calculating the difference gives me a negative value sometimes. Why is that? Can't I use `CLOCK_REALTIME` instead? – rfgamaral May 12 '10 at 12:15
  • Dammit, `CLOCK_REALTIME` also gives me negative values. Either way, they are both giving me weird results. I'm doing 1000 iterations on my operations and dividing the result by 1000 (so I can measure an average), however, I'm getting between [0.003, 0.004] seconds (for tv_sec) and between [0.0003, 0.0006] seconds (for tv_nsec). How come nano seconds has one more zero? This is not right... – rfgamaral May 12 '10 at 12:31

5 Answers5

2

Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:

rdtsc
push 1
call sleep
rdtsc

Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:

rdtsc
rdtsc
push 1
call sleep

In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.

Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.

Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).

What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • I just dug into the VMWare time keeping spec, his code _should_ be working. From the docs, rdtsc _should_ work as expected, though I'm quite sure accuracy is better when paravirtualized. – Tim Post May 11 '10 at 21:55
  • 4
    Out of order execution will NOT account for his problem. Sleep is not a single instruction, and no amount of out of order execution will re-order both RDTSC to be right next to each other. Even if it did, on an intel CPU rdtsc increments for every clock cycle of the chip. It is impossible for two calls, even if consecutive, to return the same value. – SoapBox May 11 '10 at 22:03
  • I agree with SoapBox - after all, inside the `sleep()` function is a call into the kernel; there's a very large number of instructions hidden behind that `call`. – caf May 11 '10 at 22:30
  • @caf: if he was using a single-core processor, that would be true. With two (or more) cores, it's entirely possible for two executions of RDTSC to return exactly the same value, one from each core (though this would be rare and isn't likely the source of what he's seeing). – Jerry Coffin May 11 '10 at 22:57
  • 1
    @Nazgulled: if you're truly getting a difference of 0 (not just a number than seems smaller than reasonable), then I'd have to say it sounds like a problem being caused by the virtual machine, regardless of what their docs say. On a (properly functioning) single core, no two executions of RDTSC should return the same value (without using `WRMSR` to modify it in between). – Jerry Coffin May 11 '10 at 23:11
  • @Jerry Coffin - It took a little while to dig it up, but yes, an emulated RDTSC could be the cause. – Tim Post May 11 '10 at 23:43
  • My guess is that the problem lies somewhere on the `sleep()` call and not on RDTSC. I'm saying this because the difference is 0 only with `sleep()`. If I implement a loop that takes 1s to execute, I'll have a time difference of more than 0. – rfgamaral May 12 '10 at 00:59
2

I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.

Here is a greatly simplified version of this code:

#include <stdint.h>

uint64_t rdtsc() {
    uint64_t ret;

# if __WORDSIZE == 64
    asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
        : "=A"(ret)
        : /* no input */
        : "%edx"
    );
#else
    asm ("rdtsc" 
        : "=A"(ret)
    );
#endif
    return ret;
}

Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.

SoapBox
  • 20,457
  • 3
  • 51
  • 87
  • I get an error compiling your code: `expected string literal before ')' token`. – rfgamaral May 11 '10 at 22:54
  • It compiles now, thanks for the example. However, I still get a frenquency of zero. I don't think the problem is not on how rdtsc is implemented. And since this doesn't fix my problem, I rather use the code provided by the instructor. – rfgamaral May 12 '10 at 00:56
  • Needs to be `volatile` or else the compiler can CSE two runs of the same inline asm statement with the same inputs and assume it will get the same result. (In this case the inputs are nonexistent). *That's* why @RicardoAmaral was probably getting zero. Also no reason to do the shift/or inside the asm, just use `"=a"` and `"=d"` outputs. Or better, use `__rdtsc()` from ``, although that maybe didn't exist in 2010 . See [Get CPU cycle count?](https://stackoverflow.com/a/51907627), mysticial's answer on the same question has working asm that doesn't suck. – Peter Cordes Aug 18 '18 at 12:53
1

As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):

  • Passed directly to the real hardware (PV guest)
  • Count cycles while the VM is executing on the host processor (Windows / etc)

Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.

Tim Post
  • 33,371
  • 15
  • 110
  • 174
1

You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)

This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).

See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and @Mysticial's answer there has working GNU C inline asm, which I'll quote here:

// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

This works correctly and efficiently for 32 and 64-bit code.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

hmmm I'm not positive but I suspect the problem may be inside this line:

result = (double) hi * (1 << 30) * 4 + lo;

I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

eerok512
  • 1,357
  • 2
  • 10
  • 16
  • Order of evaluation is left-to-right. It's effectively `( ((double)hi) * (1<<30) ) * 4.0 + (double)lo`. But yeah, `((uint64_t)hi<<32) + lo` would be the normal thing. Or the usual trick would be `1ULL<<32` to left-shift a 64-bit integer. – Peter Cordes Aug 18 '18 at 12:56