Failed to reproduce the high-precision time measuring kernel module from intel's white paper

Question

I am trying to reproduce How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures White Paper. This white paper provides a kernel module to accurately measure the execution time of a piece of code, by disabling preempt and using RDTSC, etc.

However, I cannot get the expected low variance when running the benchmark codes as reported in the white paper, which means the technique from the white paper doesn't work. I couldn't find out what's wrong.

The core of the kernel module is just a couple of lines

unsigned int flags;
preempt_disable();
raw_local_irq_save(flags);

asm volatile(
    "CPUID\n\t"
    "RDTSC\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    : "=r"(cycles_high), "=r"(cycles_low)::"%rax", "%rbx", "%rcx", "%rdx");
/* call the function to measure here */
asm volatile(
    "RDTSCP\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    "CPUID\n\t"
    : "=r"(cycles_high1), "=r"(cycles_low1)::"%rax", "%rbx", "%rcx", "%rdx");
raw_local_irq_restore(flags);
preempt_enable();

The codes are directly copied from the white paper with the optimizations adopted. From the white paper, the expected output should be

loop_size:995 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2216
loop_size:996 >>>> variance(cycles): 28; max_deviation: 4 ;min time: 2216
loop_size:997 >>>> variance(cycles): 0; max_deviation: 112 ;min time: 2216
loop_size:998 >>>> variance(cycles): 28; max_deviation: 116 ;min time: 2220
loop_size:999 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2224
total number of spurious min values = 0
total variance = 1
absolute max deviation = 220
variance of variances = 2
variance of minimum values = 335757

However, what I get is

[1418048.049032] loop_size:42 >>>> variance(cycles): 104027;max_deviation: 92312 ;min time: 17
[1418048.049222] loop_size:43 >>>> variance(cycles): 18694;max_deviation: 43238 ;min time: 17
[1418048.049413] loop_size:44 >>>> variance(cycles): 1;max_deviation: 60 ;min time: 17
[1418048.049602] loop_size:45 >>>> variance(cycles): 1;max_deviation: 106 ;min time: 17
[1418048.049792] loop_size:46 >>>> variance(cycles): 69198;max_deviation: 83188 ;min time: 17
[1418048.049985] loop_size:47 >>>> variance(cycles): 1;max_deviation: 60 ;min time: 17
[1418048.050179] loop_size:48 >>>> variance(cycles): 1;max_deviation: 61 ;min time: 17
[1418048.050373] loop_size:49 >>>> variance(cycles): 1;max_deviation: 58 ;min time: 17
[1418048.050374] 
                  total number of spurious min values = 2
[1418048.050374] 
                  total variance = 28714
[1418048.050375] 
                  absolute max deviation = 101796
[1418048.050375] 
                  variance of variances = 1308070648

a much higher max_deviation and variance(cycles) than the white paper. (please ignore different min time, since the white paper may be actually benchmarking something, but my codes do not actually benchmark anything.)

Is there anything I missed from the report? Or is the white paper not up-to-date and I missed some techniques in the modern x86 CPUs? How can I measure the execution time of a piece of code with the highest precision in modern intel x86 CPU architecture?

P.S. The code I run is placed here.

BTW, `lfence` is a sufficient barrier. Also note that CPUID's cycle cost can sometimes depend on incoming EAX and/or ECX values, so that's another reason to avoid it. That and/or CPU frequency warm-up might possibly explain things. (Also the inline asm statements can just use `"=a"` and `"=d"` outputs, instead of `mov` to `"=r"` outputs. The first asm statement could/should already be doing this with CPUID.) — Peter Cordes, Jan 16 '21 at 16:01
See also [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/q/13772567) for more about how RDTSC (and the time-stamp-counter it reads) actually work. — Peter Cordes, Jan 16 '21 at 16:06
For information on how to measure execution times on recent x86 CPUs, have a look at https://github.com/andreas-abel/nanoBench and https://arxiv.org/pdf/1911.03282.pdf — Andreas Abel, Jan 16 '21 at 19:16
Although it is probably not possible to predict all the corner cases, I have had very good results from the methodology described at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/ — John D McCalpin, Jan 20 '21 at 15:52

score 1 · Answer 1 · answered Jan 19 '21 at 06:37

Most Intel processors have a constant TSC, which implies that the core frequency and the TSC frequency may be different. If an operation takes a fixed number of core cycles to complete, it may take very different numbers of TSC cycles depending on the core frequency during the execution of the operation in different runs. When max_deviation is large, it indicates that the core frequency has changed significantly during the execution of that iteration. The solution is to fix the core frequency to the maximum non-turbo frequency of your processor. For more information on constant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.

please ignore different min time, since the white paper may be actually benchmarking something, but my codes do not actually benchmark anything.

The minimum values depend on the microarchitecture, the core frequency (which can dynamically change), and the TSC frequency (which is some fixed value close to the base frequency). The authors of the white paper only said they're on an Core i7 processor. In 2010, this is either a Nehalem or Westmere processor.

The measurements you've copied from the paper are from Section 3.3.2 titled "Resolution with the Alternative Method." The alternative method uses mov cr0, rax for serialization instead of rdtscp. But your code is from Section 3.2.2.

Note that the if ((end - start) < 0) {...} is never true when end and start are unsigned integers because the result of the subtraction is unsigned and the constant 0 is also converted to an unsigned type. Change it to if (end < start) {...}.

Oh right, halting the core clock to change CPU frequency can explain a huge time spike even without interrupts. It takes multiple microseconds of no core cycles for that to happen, so for an empty timed region it's a much bigger effect even than starting at idle frequency. [Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](https://stackoverflow.com/q/45472147) — Peter Cordes, Jan 19 '21 at 07:05
Another solution would be to measure core cycles instead of reference cycles using performance counters. — Andreas Abel, Jan 19 '21 at 14:11
`max_deviation` can also be higher than expected due to interference from hyperthreading. The solution for this would be to disable hyperthreading. — Andreas Abel, Jan 19 '21 at 14:14

Failed to reproduce the high-precision time measuring kernel module from intel's white paper

1 Answers1