8

I am trying to compare the times measured by c++11 std::chrono::high_resolution_clock and the rdtsc_clock clock at below. From the high_resolution_clock, I am getting result like 11000, 3000, 1000, 0. From the rdtsc_clock, I am getting 134, 15, 91 etc. Why their result look so differently? From my gut feeling, I believe the rdtsc_clock is presenting the ~accurate results, am I right?

    template<std::intmax_t clock_freq>
    struct rdtsc_clock {
        typedef unsigned long long rep;
        typedef std::ratio<1, clock_freq> period;
        typedef std::chrono::duration<rep, period> duration;
        typedef std::chrono::time_point<rdtsc_clock> time_point;
        static const bool is_steady = true;

        static time_point now() noexcept
        {

            unsigned lo, hi;
            asm volatile("rdtsc" : "=a" (lo), "=d" (hi));

            return time_point(duration(static_cast<rep>(hi) << 32 | lo));
        }
    };

The timing code:

typedef std::chrono::high_resolution_clock Clock;
//typedef rdtsc_clock<3300000000> Clock;
typedef std::chrono::nanoseconds nanoseconds;
typedef std::chrono::duration<double, typename Clock::period> Cycle;
for(int n=0; n < 10; n++){
   auto t1 = Clock::now();
   //codes
   auto t2 = Clock::now();
   printf(%.0f ns \n", duration_cast<nanoseconds>(Cycle(t2 - t1)).count());
}
Bryan Fok
  • 3,277
  • 2
  • 31
  • 59

3 Answers3

9

Issues with RDTSC usage

If you read some online docs on RDTSC, you'll see that it doesn't ensure instructions from after the RDTSC instruction aren't executed in the pipeline before the RDTSC instruction itself runs (nor that earlier instructions don't run afterwards). The normal advice is to use a CPUID instruction immediately before and/or after the RDTSC to trigger such "sequence points". Obviously this impacts program performance, and is more desirable for some kinds of measurements than others (where average throughput figures are of more interest than individual samples). You can expect that the Standard library implementation's being much more careful about this, which may be one reason its measurements are far higher.

Cross-Core Issues (not relevant per your comment)

Each CPU core maintains its own TSC register... if you just start taking samples on a thread that's not bound to a core, or on multiple threads not bound to the same core, you may see "weird" jumps in values. Some companies (e.g. Microsoft) insist that the Hardware Abstraction Laye (HAL) is responsible for trying to get the registers as close to in-sync as possible, but many (even brand new high end) PCs simply fail to do this.

You can get around this by binding to a core, or by doing some calibration step that measures the cross-core deltas (with some calibration error margin), then adjust later samples based on the core from which they're sampled (which itself is painful to determine on most CPUs - you'll need to spin taking samples between CPUID instructions or something similar).

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • process is explicitly 'taskset' to a core, and the code is single thread – Bryan Fok Nov 01 '13 at 02:09
  • @BryanFok: some other issues added above – Tony Delroy Nov 01 '13 at 02:37
  • 3
    Another issue with rdtsc is that it measures cycles and modern CPUs dynamically adjust their clockspeed to minimize power usage and control core temperature which depending on the processor may change the rate at which the counter updates so it has no straightforward relationship to time. – mattnewport Nov 01 '13 at 02:58
  • @mattnewport I also wrong about the adjustment you mentioned. Look like my concern is correct. see my another post. http://stackoverflow.com/questions/19719317/measuring-the-cpu-frequrency-scaling-effect?rq=1 – Bryan Fok Nov 01 '13 at 03:12
  • @TonyD Thanks Tony for pointing out the CPUID instruction. For reference, I can't see the difference (the calculated mean) after adding it. Of course i was running a large number of loops. – Bryan Fok Nov 01 '13 at 03:44
  • @mattnewport: quite right, and very relevant in the portable device space – Tony Delroy Nov 01 '13 at 03:53
  • 1
    In fact, i am able to measure the time has been doubled after adding the CPUID instruction. – Bryan Fok Nov 04 '13 at 01:07
  • If that still doesn't account for the magnitude of differences you're observing, I suggest having a look at the disassembly for both approaches... you may find other differences. – Tony Delroy Nov 05 '13 at 14:48
  • 1
    http://en.wikipedia.org/wiki/Time_Stamp_Counter mentions there's now a `RDTSCP` which does the same but can't be reordered at the CPU level, but it's new and not in many new CPUs yet – Mooing Duck Jan 31 '14 at 17:06
1

I think you are not comparing the same thing, on my mac this example work, rdtsc and std::chrono give the same number of cycle, if it can help:

#include <iostream>
#include <vector>
#include <numeric>
#include <chrono>

static __inline__ unsigned long long rdtsc(void){
    unsigned hi, lo;  
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32   );
}

static int sink = 0;

int main(){
    typedef std::ratio<1, 2400000000> period; // My mac @ 2.4 GHz
    unsigned long long int a,b;
    for (auto size = 1ull; size < 1000000000ull; size *= 100) {
        // record start time
        auto start = std::chrono::high_resolution_clock::now();
        a = rdtsc();
        // do some work
        std::vector<int> v(size, 42);
        sink = std::accumulate(v.begin(), v.end(), 0u); // make sure it's a side effect
        // record end time
        b = rdtsc();
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double, period> diff = end-start;
        std::cout << "Time to fill and iterate a vector of "
                  << size << " ints : " << diff.count() << " [cycle]"
                  << ", old style: "<< b-a << " [cycle] \n";
   }
}


Time to fill and iterate a vector of 1 ints : 13965.6 [cycle], old style: 13731 [cycle] 
Time to fill and iterate a vector of 100 ints : 1065.6 [cycle], old style: 969 [cycle] 
Time to fill and iterate a vector of 10000 ints : 68076 [cycle], old style: 67899 [cycle] 
Time to fill and iterate a vector of 1000000 ints : 5.4853e+06 [cycle], old style: 5483487 [cycle] 
Time to fill and iterate a vector of 100000000 ints : 6.57399e+08 [cycle], old style: 657395277 [cycle] 
Timocafé
  • 765
  • 6
  • 18
0

I'm finding pretty big differences from Timocafe's example code on my own Mac laptop

clang++ 9.1.0 -O3 Time to fill and iterate a vector of 1 ints : 27650.4 [cycle], old style: 35464 [cycle] Time to fill and iterate a vector of 100 ints : 763.2 [cycle], old style: 939 [cycle] Time to fill and iterate a vector of 10000 ints : 90712.8 [cycle], old style: 117181 [cycle] Time to fill and iterate a vector of 1000000 ints : 4.79993e+06 [cycle], old style: 6199891 [cycle] Time to fill and iterate a vector of 100000000 ints : 4.80331e+08 [cycle], old style: 620426953 [cycle]

g++ 5.5 -O3 Time to fill and iterate a vector of 1 ints : 2400 [cycle], old style: 1324 [cycle] Time to fill and iterate a vector of 100 ints : 0 [cycle], old style: 944 [cycle] Time to fill and iterate a vector of 10000 ints : 96000 [cycle], old style: 125444 [cycle] Time to fill and iterate a vector of 1000000 ints : 5.4648e+06 [cycle], old style: 7059362 [cycle] Time to fill and iterate a vector of 100000000 ints : 5.05517e+08 [cycle], old style: 652940006 [cycle]

Things like 0 time is bothersome. That indicates the compiler re-ordered things around the high_precision_clock. At least with our assembly code rdtsc timers we can use volatile to get some behavior we desire. If I put the rdtsc calls inside the high_precision_clock calls then I can get a monotonic clock, which would tell me that it is the volatile in our assembly that is preserving ordering. Also, the time_point seems to be in different units and accuracy across the two compilers. ugh.

Meta.x.gdb
  • 173
  • 1
  • 6