3

I'm following this whitepaper by intel to benchmark code execution. It uses cpuid to fence the reads of the timestamp registers, which seems to work alright.

I'm more interested in the commands preempt_disable() and local_irq_save(), used to prevent any interference while measuring.

When I'm running a benchmark like this, measuring nothing, I get an average of 24 cycles. However, around 10 of 100'000 measurements take a "long" time, i.e. multiple 10k cycles. What is the root of theses spikes and how can I get rid of them?

I wrote a minimalkernel module to do the measurements, source code can be found on github.

I'm running an Ubuntu 18.04.5 LTS with a 4.15.0-118-generic kernel on an Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz. I've tried to run the benchmark on an i9-9900 but got the same spikes there.

mohabbati
  • 1,162
  • 1
  • 13
  • 31
  • Not the answer, but `lfence` is recommended instead of `cpuid`; it still serializes instruction execution but is much cheaper, and doesn't step or registers, and has no input registers that can affect how many cycles it takes. (Also doesn't drain the store buffer; if you want that you can use a locked instruction or mfence). [Is LFENCE serializing on AMD processors?](https://stackoverflow.com/q/51844886) - yes, so this is safe on both AMD and Intel. See also [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) – Peter Cordes Dec 21 '20 at 18:54
  • 1
    Could the slowdowns be when the CPU decides to change frequency? The clock stops while doing that (but the TSC of course keeps ticking). [Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](https://stackoverflow.com/q/45472147) estimates how long that takes. – Peter Cordes Dec 21 '20 at 19:01
  • 1
    Hi @PeterCordes, thank you very much for the comments! Some very interesting reads, I'll go right through it tomorrow. I gave it a quick shot from on the road, just disabling Turbo Boost and the results look promising. While before, I got this: _Min: 22, Max: 40290, Median: 24 cycles, Mean: 24.1 cycles (std dev: 78.5)_. \\ I am now getting much better results: _Min: 30, Max: 318, Median: 32, Mean: 32.4 (std dev: 3.6)_ – IRatherStayPrivate Dec 21 '20 at 20:09
  • 1
    Note that RDTSC doesn't even count core clock cycles; for that you need `rdpmc` (after programming a counter to count `cpu_clk_unhalted.thread` or similar even, e.g. using `perf` / PAPI, although IDK how easy that is inside the kernel.) With turbo disabled, TSC frequency is often similar to "sticker" frequency, but these days usually *not* exact, and can differ wildly on some CPUs. e.g. my i7-6700k has a 4008MHz TSC, vs. 4.0 GHz rated non-turbo frequency. – Peter Cordes Dec 21 '20 at 20:15

0 Answers0