I want to time a function call with rdtsc. So I measured it in two ways as follows.
- Call it in a loop. Aggregate each rdtsc difference within the loop and divide by number of calls. (Let's say this is N)
- Call it in a loop. Get the rdtsc difference of the loop itself and divide by N.
But I see couple of inconsistent behaviors.
- When I increase N the times get reduced rather monotonically in both method 1 and 2. For method 2 it is understandable in that it would amortize the loop control overhead. But I am not sure how it is so for method 1.
- Actually for method 2 each time when I increase the N the value I get for N=1 seems to be just divided by the new N each time. Inspecting gdb disassembly made me realize it is some compiler optimization at -O2 where loop is skipped at second case. So I retried with -O0, where the gdb disassembly show the actual loop being there for the second case as well.
Code is given below.
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
typedef unsigned long long ticks;
static __inline__ ticks getticks(void) {
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((ticks)a) | (((ticks)d) << 32);
}
__attribute__ ((noinline))
void bar() {
}
int main(int argc, char** argv) {
long long N = 1000000;
N = atoi(argv[1]);
int i;
long long bar_total = 0;
ticks start = 0, end = 0;
for (i = 0; i < N; i++) {
start = getticks();
bar();
end = getticks();
bar_total += (end - start);
}
fprintf(stdout, "Total invocations : %lld\n", N);
fprintf(stdout, "[regular] bar overhead : %lf\n", ((double)bar_total/ N));
start = getticks();
for (i = 0; i < N; i++) {
bar();
}
end = getticks();
bar_total = (end - start);
fprintf(stdout, "[Loop] bar overhead : %lf\n", ((double)bar_total/ N));
return 0;
}
Any idea what's going on here? I can put the gdb disassembly if needed as well. I used the rdtsc implementation from http://dasher.wustl.edu/tinker/distribution/fftw/kernel/cycle.h
Edit: I am going to have to retract my second statement that at -O0 the time gets dropped directly proportional to N in the second case. I guess it's some mistake I made during the build causing some older version to persist. Any how it still goes down somewhat along with figure for method 1. Here are some numbers for different N values.
taskset -c 2 ./example.exe 1
Total invocations : 1
[regular] bar overhead : 108.000000
[Loop] bar overhead : 138.000000
taskset -c 2 ./example.exe 10
Total invocations : 10
[regular] bar overhead : 52.900000
[Loop] bar overhead : 40.700000
taskset -c 2 ./example.exe 100
Total invocations : 100
[regular] bar overhead : 46.780000
[Loop] bar overhead : 15.570000
taskset -c 2 ./example.exe 1000
Total invocations : 1000
[regular] bar overhead : 46.069000
[Loop] bar overhead : 13.669000
taskset -c 2 ./example.exe 100000
Total invocations : 10000
[regular] bar overhead : 46.010100
[Loop] bar overhead : 13.444900
taskset -c 2 ./example.exe 100000000
Total invocations : 100000000
[regular] bar overhead : 26.970272
[Loop] bar overhead : 5.201252
taskset -c 2 ./example.exe 1000000000
Total invocations : 1000000000
[regular] bar overhead : 18.853279
[Loop] bar overhead : 5.218234
taskset -c 2 ./example.exe 10000000000
Total invocations : 1410065408
[regular] bar overhead : 18.540719
[Loop] bar overhead : 5.216395
I see two new behaviors now.
- Method 1 converges slower than the method 2. But still I am puzzling over why there is such a drastic difference in values for different N settings. Perhaps I am doing some basic mistake here which I don't see at the moment.
- Method 1 value is actually larger than method 2 by some margin. I expected it be on par or slightly smaller than the method 2 value since it doesn't contain loop control overhead.
Questions
So in summary my questions are
Why are the values given by both methods change so drastically when increasing the N? Specially for method 1 which doesn't account for loop control overhead.
Why is second method result is less than the first method's when first method excludes the loop control overhead in the calculations?
Edit 2
Regarding the suggested rdtscp solution.
Being uninitiated about the inline assembly I did the following.
static __inline__ ticks getstart(void) {
unsigned cycles_high = 0, cycles_low = 0;
asm volatile ("CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
return ((ticks)cycles_high) | (((ticks)cycles_low) << 32);
}
static __inline__ ticks getend(void) {
unsigned cycles_high = 0, cycles_low = 0;
asm volatile("RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
return ((ticks)cycles_high) | (((ticks)cycles_low) << 32);
}
and used above methods before and after the function call. But now I get non sensical results like follows.
Total invocations : 1000000
[regular] bar overhead : 304743228324.708374
[Loop] bar overhead : 33145641307.734016
What's the catch? I wanted to factor out those as inlined methods since I see use of it in multiple places.
A. Solution in the comments.