assembler benchmarking on intel using rdtsc is giving strange answers, why?

Question

A while ago, I asked a question on stack overflow and was shown how to execute the rdtsc opcode in C++. I recently created a benchmark function using rdtsc as follows:

inline unsigned long long rdtsc() {
  unsigned int lo, hi;
  asm volatile (
     "cpuid \n"
     "rdtsc" 
   : "=a"(lo), "=d"(hi) /* outputs */
   : "a"(0)             /* inputs */
   : "%ebx", "%ecx");     /* clobbers*/
  return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}

typedef uint64_t (*FuncOneInt)(uint32_t n);
/**
     time a function that takes an integer parameter and returns a 64 bit number
     Since this is capable of timing in clock cycles, we won't have to do it a
     huge number of times and divide, we can literally count clocks.
     Don't forget that everything takes time including getting into and out of the
     function.  You may want to time an empty function.  The time to do the computation
     can be compute by taking the time of the function you want minus the empty one.
 */
void clockBench(const char* msg, uint32_t n, FuncOneInt f) {
    uint64_t t0 = rdtsc();
    uint64_t r = f(n);
    uint64_t t1 = rdtsc();
    std::cout << msg << "n=" << n << "\telapsed=" << (t1-t0) << '\n';
}

I therefore assumed that if I benchmarked a function, I would have (roughly) the number of clock cycles that it took to execute. I also assumed that if I wanted to subtract the number of clock cycles that it took to get into or out of the function, I should benchmark an empty function, then write one with the desired code inside.

Here is a sample:

uint64_t empty(uint32_t n) {
    return 0;
}

uint64_t sum1Ton(uint32_t n) {
    uint64_t s = 0;
    for (int i = 1; i <= n; i++)
        s += i;
    return s;
}

The code is compiled using

g++ -g -O2

I could understand if there is some error due to an interrupt or some other condition, but given that these routines are short, and n is chosen to be small, I assumed that I could see the real numbers. But to my surprise, this is the output from two successive runs

empty n=100 elapsed=438
Sum 1 to n=100  elapsed=887

empty n=100 elapsed=357
Sum 1 to n=100  elapsed=347

Consistently the empty function shows that it takes way more than it should.

After all, there are only a few instructions involved in getting in and out of the function. The real work is done in the loop. Never mind the fact that the variance is huge. In the second run, the empty function claims to be taking 357 clock cycles and the sum takes less, which is ridiculous.

What is happening?

What compiler options did you use? The loop should optimize into a multiply (Gauss's formula) with optimization enabled, for gcc and/or clang. Are you only ever timing the first call to this function, where there's probably an I-cache miss? It's probably in the same line as the sum function. Also, you aren't doing anything like `lfence` *after* `rdtsc` to stop the "work" from running before the clock starts. See [clflush to invalidate cache line via C function](https://stackoverflow.com/a/51830976) for an example. — Peter Cordes, Sep 23 '18 at 12:15
Did you control for Turbo frequency at all? `rdtsc` doesn't time core clock cycles, just wall-clock-equivalent reference cycles on modern CPUs. [Get CPU cycle count?](https://stackoverflow.com/a/51907627). — Peter Cordes, Sep 23 '18 at 12:18
I have never seen this compile into Gauss' formula, and I have run this sort of test a lot with -O2 in C++ classes. g++ will recognize when a loop is unnecessary and remove it, so I have always been able to fool the optimizer with summation. — Dov, Sep 23 '18 at 12:41
Having called the function once before the benchmark, it should be in cachee. I will eventually change to the intrinsics, and get lfence that way, but if what you say is true, the clock should read nearer zero. How could I possibly get an answer of hundreds of clock cycles for the empty function? — Dov, Sep 23 '18 at 12:44
https://godbolt.org/z/fvLNdv shows clang using Gauss's formula. But `gcc -O3` doesn't spot it, and just auto-vectorizes the additions. — Peter Cordes, Sep 23 '18 at 13:00
There are so many foot-guns possible when measuring small code sequences, I highly recommend to use an existing benchmark framework like Google Benchmark. Is your goal to actually measure this method accurately, or to learn more about writing reliable benchmarks without using a third party component? — BeeOnRope, Sep 23 '18 at 21:20

score 6 · Accepted Answer · answered Sep 23 '18 at 13:12

Consistently the empty function shows that it takes way more than it should.

You have cpuid inside the timed interval. cpuid on Intel Sandybridge-family CPUs takes 100 to 250 core clock cycles (depending on the inputs, which you neglected to set), according to Agner Fog's testing. (https://agner.org/optimize/).

But you're not measuring core clock cycles, you're measuring RDTSC reference cycles, which can be significantly shorter. (e.g. my Skylake i7-6700k idles at 800MHz, but the reference clock freq is 4008 MHz.) See Get CPU cycle count? for my attempt at a canonical answer on rdtsc.

Warm up the CPU first, or run a pause busy loop on another core to keep it pegged at max (assuming it's a desktop / laptop dual or quad core, where all core freqs are locked together.)

Never mind the fact that the variance is huge. In the second run, the empty function claims to be taking 357 clock cycles and the sum takes less, which is ridiculous.

Is that effect also consistent?

Maybe your CPU ramped up to full speed during / after printing the 3rd line of messages, making the last timed region run a lot faster? (Why does this delay-loop start to run faster after several iterations with no sleep?).

IDK how much effect different garbage in eax and ecx before cpuid could have. Replace it with lfence to eliminate that and use a much lower overhead way to serialize rdtsc.

lfence and warming worked pretty well. It seems fairly consistent now, but I will keep looking — Dov, Sep 30 '18 at 08:50

assembler benchmarking on intel using rdtsc is giving strange answers, why?

1 Answers1