rdtsc timing for a measuring a function

Question

I want to time a function call with rdtsc. So I measured it in two ways as follows.

Call it in a loop. Aggregate each rdtsc difference within the loop and divide by number of calls. (Let's say this is N)
Call it in a loop. Get the rdtsc difference of the loop itself and divide by N.

But I see couple of inconsistent behaviors.

When I increase N the times get reduced rather monotonically in both method 1 and 2. For method 2 it is understandable in that it would amortize the loop control overhead. But I am not sure how it is so for method 1.
Actually for method 2 each time when I increase the N the value I get for N=1 seems to be just divided by the new N each time. Inspecting gdb disassembly made me realize it is some compiler optimization at -O2 where loop is skipped at second case. So I retried with -O0, where the gdb disassembly show the actual loop being there for the second case as well.

Code is given below.

    #include <stdio.h>
    #include <inttypes.h>
    #include <stdlib.h>

    typedef unsigned long long ticks;

    static __inline__ ticks getticks(void) {
      unsigned a, d; 
      asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
      return ((ticks)a) | (((ticks)d) << 32); 
    }

    __attribute__ ((noinline))
    void bar() {

    }

    int main(int argc, char** argv) {

       long long N = 1000000; 
       N = atoi(argv[1]);
       int i;
       long long bar_total = 0;

       ticks start = 0, end = 0;

       for (i = 0; i < N; i++) {
         start = getticks();
         bar();
         end = getticks();
         bar_total += (end - start);
       } 

       fprintf(stdout, "Total invocations : %lld\n", N);
       fprintf(stdout, "[regular] bar overhead : %lf\n", ((double)bar_total/  N));

      start = getticks();
      for (i = 0; i < N; i++) {
        bar();
      } 
      end = getticks();

      bar_total = (end - start);

      fprintf(stdout, "[Loop] bar overhead : %lf\n", ((double)bar_total/ N));

      return 0;

     }

Any idea what's going on here? I can put the gdb disassembly if needed as well. I used the rdtsc implementation from http://dasher.wustl.edu/tinker/distribution/fftw/kernel/cycle.h

Edit: I am going to have to retract my second statement that at -O0 the time gets dropped directly proportional to N in the second case. I guess it's some mistake I made during the build causing some older version to persist. Any how it still goes down somewhat along with figure for method 1. Here are some numbers for different N values.

taskset -c 2 ./example.exe 1
Total invocations : 1
[regular] bar overhead : 108.000000
[Loop] bar overhead : 138.000000

taskset -c 2 ./example.exe 10
Total invocations : 10
[regular] bar overhead : 52.900000
[Loop] bar overhead : 40.700000

taskset -c 2 ./example.exe 100
Total invocations : 100
[regular] bar overhead : 46.780000
[Loop] bar overhead : 15.570000

taskset -c 2 ./example.exe 1000
Total invocations : 1000
[regular] bar overhead : 46.069000
[Loop] bar overhead : 13.669000

taskset -c 2 ./example.exe 100000
Total invocations : 10000
[regular] bar overhead : 46.010100
[Loop] bar overhead : 13.444900

taskset -c 2 ./example.exe 100000000
Total invocations : 100000000
[regular] bar overhead : 26.970272
[Loop] bar overhead : 5.201252

taskset -c 2 ./example.exe 1000000000
Total invocations : 1000000000
[regular] bar overhead : 18.853279
[Loop] bar overhead : 5.218234

taskset -c 2 ./example.exe 10000000000
Total invocations : 1410065408
[regular] bar overhead : 18.540719
[Loop] bar overhead : 5.216395

I see two new behaviors now.

Method 1 converges slower than the method 2. But still I am puzzling over why there is such a drastic difference in values for different N settings. Perhaps I am doing some basic mistake here which I don't see at the moment.
Method 1 value is actually larger than method 2 by some margin. I expected it be on par or slightly smaller than the method 2 value since it doesn't contain loop control overhead.

Questions

So in summary my questions are

Why are the values given by both methods change so drastically when increasing the N? Specially for method 1 which doesn't account for loop control overhead.
Why is second method result is less than the first method's when first method excludes the loop control overhead in the calculations?

Edit 2

Regarding the suggested rdtscp solution.

Being uninitiated about the inline assembly I did the following.

static __inline__ ticks getstart(void) {
  unsigned cycles_high = 0, cycles_low = 0; 
  asm volatile ("CPUID\n\t"
             "RDTSC\n\t"
             "mov %%edx, %0\n\t"
             "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
             "%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
}

static __inline__ ticks getend(void) {
  unsigned cycles_high = 0, cycles_low = 0; 
  asm volatile("RDTSCP\n\t"
         "mov %%edx, %0\n\t"
          "mov %%eax, %1\n\t"
           "CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low)::
           "%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
}

and used above methods before and after the function call. But now I get non sensical results like follows.

Total invocations : 1000000
[regular] bar overhead : 304743228324.708374
[Loop] bar overhead : 33145641307.734016

What's the catch? I wanted to factor out those as inlined methods since I see use of it in multiple places.

A. Solution in the comments.

Added some outputs for different N values. Please also see my retraction. — chamibuddhika, Mar 22 '15 at 01:38
Thank you. What is your CPU? And what is the question? You should modify your rdtsc function to include `cpuid` or other serializing instruction - check https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf page 3 "3.1. Out-of-Order-Execution"; or switch to `rdtscp`: http://stackoverflow.com/questions/12631856/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc?rq=1 — osgx, Mar 22 '15 at 01:39
uname -a gives Linux 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux. From /proc/cpuinfo it is a Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with 48 cores. — chamibuddhika, Mar 22 '15 at 01:41
Thanks. I added a new summary with questions. I am not familiar with specifics of rdtsc implementations. Do you think that would explain the above behaviors? I will try using rdtscp as well. — chamibuddhika, Mar 22 '15 at 01:49

osgx · Accepted Answer · 2015-03-22T02:25:16.650

3

You use plain rdtsc instruction, which may not work correctly on Out-of-order CPUs, like Xeons and Cores. You should add some serializing instruction or switch to rdtscp instruction:

http://en.wikipedia.org/wiki/Time_Stamp_Counter

Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.[3] This problem can be solved by executing a serializing instruction, such as CPUID, to force every preceding instruction to complete before allowing the program to continue, or by using the RDTSCP instruction, which is a serializing variant of the RDTSC instruction.

Intel has recent manual of using rdtsc/rdtscp - How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures (ia-32-ia-64-benchmark-code-execution-paper.pdf, 324264-001, 2010). They recommend cpuid+rdtsc for start and rdtscp for end timers:

The solution to the problem presented in Section 0 is to add a CPUID instruction just after the RDTPSCP and the two mov instructions (to store in memory the value of edx and eax). The implementation is as follows:

asm volatile ("CPUID\n\t"
 "RDTSC\n\t"
 "mov %%edx, %0\n\t"
 "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
/***********************************/
/*call the function to measure here*/
/***********************************/
asm volatile("RDTSCP\n\t"
 "mov %%edx, %0\n\t"
 "mov %%eax, %1\n\t"
 "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
"%rax", "%rbx", "%rcx", "%rdx");

start = ( ((uint64_t)cycles_high << 32) | cycles_low );
end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );

In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. Nevertheless, this call does not affect the measurement since it comes before the RDTSC (i.e., before the timestamp register is read). The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. If the code is a call to a function, it is recommended to declare such function as “inline” so that from an assembly perspective there is no overhead in calling the function itself. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed.

You example is not very correct; you try to measure empty function bar(), but it is so short that you are measuring rdtsc overhead in method 1 (for() { rdtsc; bar(); rdtsc)). According to the Agner Fog's table for haswell - http://www.agner.org/optimize/instruction_tables.pdf page 191 (long table "Intel Haswell List of instruction timings and μop breakdown", at the very end of it) RDTSC has 15 uops (no fusion possible) and the latency of 24 ticks; RDTSCP (for older microarchitecture Sandy Bridge has 23 uops and 36 ticks latency versus 21 uops and 28 ticks for rdtsc). So, you can't use plain rdtsc (or rdtscp) to directly measure such short code.

edited Mar 22 '15 at 02:25

answered Mar 22 '15 at 01:51

osgx

90,338
53
357
513

Interesting. I will try this out and report. Btw just reading the assembly wouldn't the two moves just after initial rdtscp be included in the timing as well? – chamibuddhika Mar 22 '15 at 01:58
chamibuddhika, they will be started, and they will use two or three execution ports of CPU; but in wide superscalar out-of-order cpu moves will almost not affect your result - they don't prevent your instructions to be started by CPU. – osgx Mar 22 '15 at 02:00
Please see my second edit. Thanks for the informative inputs. – chamibuddhika Mar 22 '15 at 02:16
chamibuddhika, wrong part was shifted - check pdf http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf `start = ( ((uint64_t)cycles_high << 32) | cycles_low ); end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );` – osgx Mar 22 '15 at 02:18
Argh... Right. Now it's better. But the previous behavior doesn't seem to have changed. I tried on a different machine as well. May be I am doing something really stupid here. – chamibuddhika Mar 22 '15 at 02:33
chamibuddhika, I think you should add real code to `bar`; there is some measuring overhead for executing rdtsc/rdtscp (20-30 cycles; there are two rdtscs and one of them will be included into interval), and empty function is too small and to fast to be measured by method 1. Check update of my answer from "You example is not very correct;...". PS: there are public microarchitecture simulators, for example IACA from Intel - http://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it - it can show microoperations, latency and execution ports for linear code (not for calls/jmps) – osgx Mar 22 '15 at 02:42
chamibuddhika, and you should disable dynamic frequency changing.... tsc is reported in some basic units, not in ticks; for large N you CPU will increase its frequency, but tsc units will be not changed. Run `taskset -c 2 perf stat ./example.exe 100000000` to get real frequency (mean for the running time of the program, computed with CPU cycles to real time ratio, correct for not too short CPU bound tasks). – osgx Mar 22 '15 at 02:45
Hmm I was just reading wikipedia entry on this. So this constant tsc shouldn't it make the results less skewed rather than making it more inconsistent? I just checked constant_tsc seems to supported by the CPU. – chamibuddhika Mar 22 '15 at 04:18
I am going to accept this answer since it has been very useful. But still I am puzzled with the fact that constant_tsc giving skewed results as per my earlier comment. – chamibuddhika Mar 22 '15 at 16:19
It is not constant_tsc to skew results, only to short (incorrectly designed) test does skewing. Constant tsc is only for TSC not equal to the CPU frequency, but scaling factor is constant same for every possible frequency - when Tubro Boost 2.0 is activated, your CPU goes from 2.3 GHz to 3.1 GHz; and constant tsc means that tsc increments as some constant rate, not equal to CPU frequency (it can tick at 2.3 GHz or at 3.0 GHz - I have no idea). – osgx Mar 22 '15 at 21:51
In the code you posted, what guarantees that the code between CPUID and RDTSCP won't be re-ordered? Ie how do we know that RDTSC is going to be executed before the fun-to-measure and not in the middle or after that? – ilstam Feb 07 '20 at 18:06
@ilstam, use rdtscp if supported: https://www.felixcloutier.com/x86/rdtscp optionally with additional *fence instructions. cpuid is serializing and architectural rdtsc/rdtscp will have no deps to wait, so it will be decoded and started early (it is first decoded instr after serializing). It will be executed in dozens of ticks on many uarch, because it is microcoded: https://www.agner.org/optimize/instruction_tables.pdf + https://software.intel.com/en-us/forums/watercooler-catchall/topic/698640. So it is not reliable to measure 10 ticks but ok to measure 10000 ticks. Or use intel-pt (simple-pt) – osgx Feb 07 '20 at 20:10

score 1 · Answer 2 · answered Mar 22 '15 at 01:51

1

Have you tried clock_gettime(CLOCK_MONOTONIC, &tp)? Should be quite near to reading the cycle counter by hand, also keep in mind that the cycle counter may not be synchronized between cpu cores.

answered Mar 22 '15 at 01:51

Jacob

63
4

1

No I haven't. I will try it as well. I was using taskset to pin the process to prevent such a case from happening so I guess that is accounted for. – chamibuddhika Mar 22 '15 at 01:54
[I think](http://stackoverflow.com/questions/3388134/rdtsc-accuracy-across-cpu-cores), that tsc will not be synchronized only on some older NUMA (multi-socket) motherbords; Linux trys to synchronize them: http://lxr.free-electrons.com/source/arch/x86/kernel/tsc_sync.c. And tsc counts not cpu cycles, but some monotonic timer ('Invariant TSC'). PS: what about `CLOCK_MONOTONIC_RAW`? – osgx Mar 22 '15 at 01:58

rdtsc timing for a measuring a function

2 Answers2

Linked