Is it preferable to use the total time taken for a canonical workload as a benchmark or count the cycles/time taken by the individual operations?

Question

I'm designing a benchmark for a critical system operation. Ideally the benchmark can be used to detect performance regressions. I'm debating between using the total time for a large workload passed into the operation or counting the cycles taken by the operation as the measurement criterion for the benchmark.

The time to run each iteration of the operation in question is fast perhaps 300-500 nanoseconds.

score 1 · Accepted Answer · answered Oct 15 '18 at 15:24

A total time is much easier to measure accurately / reliably, and the measurement overhead is irrelevant. It's what I'd recommend, as long as you're sure you can stop your compiler from optimizing across iterations of whatever you're measuring. (Check the generated asm if necessary).

If you think your runtime might be data-dependent and want to look into variation across iterations, then you might consider recording timestamps somehow. But 300 ns is only ~1k clock cycles on a 3.3GHz CPU, and recording a timestamp takes some time. So you definitely need to worry about measurement overhead.

Assuming you're on x86, raw rdtsc around each operation is pretty lightweight, but out-of-order execution can reorder the timestamps with the work. Get CPU cycle count?, and clflush to invalidate cache line via C function.

An lfence; rdtsc; lfence to stop the timing from reordering with each iteration of the workload will block out-of-order execution of the steps of the workload, distorting things. (The out-of-order execution window on Skylake is a ROB size of 224 uops. At 4 per clock that's a small fraction of 1k clock cycles, but in lower-throughput code with stalls for cache misses there could be significant overlap between independent iterations.)

Any standard timing functions like C++ std::chrono will normally call library functions that ultimately use rdtsc, but with many extra instructions. Or worse, will make an actual system call taking well over a hundred clock cycles to enter/leave the kernel, and more with Meltdown+Spectre mitigation enabled.

However, one thing that might work is using Intel-PT (https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing) to record timestamps on taken branches. Without blocking out-of-order exec at all, you can still get timestamps on when the loop branch in your repeat loop executed. This may well be independent of your workload and able to run soon after its issued into the out-of-order part of the core, but that can only happen a limited distance ahead of the oldest not-yet-retired instruction.

Is it preferable to use the total time taken for a canonical workload as a benchmark or count the cycles/time taken by the individual operations?

1 Answers1