I am trying to reproduce How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures White Paper. This white paper provides a kernel module to accurately measure the execution time of a piece of code, by disabling preempt and using RDTSC
, etc.
However, I cannot get the expected low variance when running the benchmark codes as reported in the white paper, which means the technique from the white paper doesn't work. I couldn't find out what's wrong.
The core of the kernel module is just a couple of lines
unsigned int flags;
preempt_disable();
raw_local_irq_save(flags);
asm volatile(
"CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
: "=r"(cycles_high), "=r"(cycles_low)::"%rax", "%rbx", "%rcx", "%rdx");
/* call the function to measure here */
asm volatile(
"RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t"
: "=r"(cycles_high1), "=r"(cycles_low1)::"%rax", "%rbx", "%rcx", "%rdx");
raw_local_irq_restore(flags);
preempt_enable();
The codes are directly copied from the white paper with the optimizations adopted. From the white paper, the expected output should be
loop_size:995 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2216
loop_size:996 >>>> variance(cycles): 28; max_deviation: 4 ;min time: 2216
loop_size:997 >>>> variance(cycles): 0; max_deviation: 112 ;min time: 2216
loop_size:998 >>>> variance(cycles): 28; max_deviation: 116 ;min time: 2220
loop_size:999 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2224
total number of spurious min values = 0
total variance = 1
absolute max deviation = 220
variance of variances = 2
variance of minimum values = 335757
However, what I get is
[1418048.049032] loop_size:42 >>>> variance(cycles): 104027;max_deviation: 92312 ;min time: 17
[1418048.049222] loop_size:43 >>>> variance(cycles): 18694;max_deviation: 43238 ;min time: 17
[1418048.049413] loop_size:44 >>>> variance(cycles): 1;max_deviation: 60 ;min time: 17
[1418048.049602] loop_size:45 >>>> variance(cycles): 1;max_deviation: 106 ;min time: 17
[1418048.049792] loop_size:46 >>>> variance(cycles): 69198;max_deviation: 83188 ;min time: 17
[1418048.049985] loop_size:47 >>>> variance(cycles): 1;max_deviation: 60 ;min time: 17
[1418048.050179] loop_size:48 >>>> variance(cycles): 1;max_deviation: 61 ;min time: 17
[1418048.050373] loop_size:49 >>>> variance(cycles): 1;max_deviation: 58 ;min time: 17
[1418048.050374]
total number of spurious min values = 2
[1418048.050374]
total variance = 28714
[1418048.050375]
absolute max deviation = 101796
[1418048.050375]
variance of variances = 1308070648
a much higher max_deviation and variance(cycles) than the white paper.
(please ignore different min time
, since the white paper may be actually benchmarking something, but my codes do not actually benchmark anything.)
Is there anything I missed from the report? Or is the white paper not up-to-date and I missed some techniques in the modern x86 CPUs? How can I measure the execution time of a piece of code with the highest precision in modern intel x86 CPU architecture?
P.S. The code I run is placed here.