Linux perf stat
has a -r repeat_count
option. Its output only gives you the mean and standard deviation for each HW/software event, not min/max as well.
It doesn't discard the first run as a warm-up or anything either, but it's somewhat useful in a lot of cases.
Scroll to the right for the stddev results like ( +- 0.13% )
for cycles. Less variance in that than in task-clock
, probably because CPU frequency was not fixed. (I intentionally picked a quite short run time, although with Skylake hardware P-state and EPP=performance, it should be ramping up to max turbo quite quickly even compared to a 34 ms run time. But for a CPU-bound task that's not memory-bound at all, its interpreter loop runs at a constant number of clock cycles per iteration, modulo only branch misprediction and interrupts. --all-user
is counting CPU events like instructions and cycles only for user-space, not inside interrupt handlers and system calls / page-faults.)
$ perf stat --all-user -r5 awk 'BEGIN{for(i=0;i<1000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<1000000;i++){}}' (5 runs):
34.10 msec task-clock # 0.984 CPUs utilized ( +- 0.40% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
178 page-faults # 5.180 K/sec ( +- 0.42% )
139,277,791 cycles # 4.053 GHz ( +- 0.13% )
360,590,762 instructions # 2.58 insn per cycle ( +- 0.00% )
97,439,689 branches # 2.835 G/sec ( +- 0.00% )
16,416 branch-misses # 0.02% of all branches ( +- 8.14% )
0.034664 +- 0.000143 seconds time elapsed ( +- 0.41% )
awk
here is just a busy-loop to give us something to measure. If you're using this to microbenchmark a loop or function, construct it to have minimal startup overhead as a fraction of total run time, so perf stat
event counts for the whole run mostly reflect the code you wanted to time. Often this means building a repeat-loop into your own program, to loop over the initialized data multiple times.
See also Idiomatic way of performance evaluation? - timing very short things is hard due to measurement overhead. Carefully constructing a repeat loop that tells you something interesting about the throughput or latency of your code under test is important.
Run-to-run variation is often a thing, but often back-to-back runs like this will have less variation within the group than between runs separated by half a second to up-arrow/return. Perhaps something to do with transparent hugepage availability, or choice of alignment? Usually for small microbenchmarks, so not sensitive to the file getting evicted from the pagecache.
(The +- range printed by perf
is just I think one standard deviation based on the small sample size, not the full range it saw.)