How do I get consistent criterion benchmarks, or interpret results across runs?

Question

I'm trying to optimize some code, using criterion to try to compare, for example, the effect of adding INLINE pragma to a function. But I'm finding results are not consistent between re-compiles/runs.

I need to know how to get results either to be consistent across runs so that I can compare them, or how to assess whether a benchmark is reliable or not, i.e. (I guess) how to interpret the details about variance, "cost of a clock call", etc.

Details on my particular case

This is orthogonal to my main questions above, but a couple things might be causing inconsistency in my particular case:

I'm trying to benchmark IO actions using whnfIO because the method using whnf in this example didn't work.
my code uses concurrency
I've got a lot of tabs and crap open

Example output

Both of these are from the same code, compiled in the exact same way. I did the first run directly below, made a change and did another benchmark, then reverted and ran the first code again, compiling with:

ghc --make -fforce-recomp -threaded -O2 Benchmark.hs

First run:

estimating clock resolution...                                      
mean is 16.97297 us (40001 iterations)                              
found 6222 outliers among 39999 samples (15.6%)                     
  6055 (15.1%) high severe                                          
estimating cost of a clock call...                                  
mean is 1.838749 us (49 iterations)                                 
found 8 outliers among 49 samples (16.3%)                           
  3 (6.1%) high mild                                                
  5 (10.2%) high severe                                             

benchmarking actors/insert 1000, query 1000                         
collecting 100 samples, 1 iterations each, in estimated 12.66122 s  
mean: 110.8566 ms, lb 108.4353 ms, ub 113.6627 ms, ci 0.950         
std dev: 13.41726 ms, lb 11.58487 ms, ub 16.25262 ms, ci 0.950      
found 2 outliers among 100 samples (2.0%)                           
  2 (2.0%) high mild                                                
variance introduced by outliers: 85.211%                            
variance is severely inflated by outliers                           

benchmarking actors/insert 1000, query 100000                       
collecting 100 samples, 1 iterations each, in estimated 945.5325 s  
mean: 9.319406 s, lb 9.152310 s, ub 9.412688 s, ci 0.950            
std dev: 624.8493 ms, lb 385.4364 ms, ub 956.7049 ms, ci 0.950      
found 6 outliers among 100 samples (6.0%)                           
  3 (3.0%) low severe                                               
  1 (1.0%) high severe                                              
variance introduced by outliers: 62.576%                            
variance is severely inflated by outliers

Second run, ~3x slower:

estimating clock resolution...
mean is 51.46815 us (10001 iterations)
found 203 outliers among 9999 samples (2.0%)
  117 (1.2%) high severe
estimating cost of a clock call...
mean is 4.615408 us (18 iterations)
found 4 outliers among 18 samples (22.2%)
  4 (22.2%) high severe

benchmarking actors/insert 1000, query 1000
collecting 100 samples, 1 iterations each, in estimated 38.39478 s
mean: 302.4651 ms, lb 295.9046 ms, ub 309.5958 ms, ci 0.950
std dev: 35.12913 ms, lb 31.35431 ms, ub 42.20590 ms, ci 0.950
found 1 outliers among 100 samples (1.0%)
variance introduced by outliers: 84.163%
variance is severely inflated by outliers

benchmarking actors/insert 1000, query 100000
collecting 100 samples, 1 iterations each, in estimated 2644.987 s
mean: 27.71277 s, lb 26.95914 s, ub 28.97871 s, ci 0.950
std dev: 4.893489 s, lb 3.373838 s, ub 7.302145 s, ci 0.950
found 21 outliers among 100 samples (21.0%)
  4 (4.0%) low severe
  3 (3.0%) low mild
  3 (3.0%) high mild
  11 (11.0%) high severe
variance introduced by outliers: 92.567%
variance is severely inflated by outliers

I notice that if I scale by "estimated cost of a clock call" the two benchmarks are fairly close. Is that what I should do to get a real number for comparing?

`criterion` generates standard deviation values. These tell you how accurate a measurement was. If you want higher accuracy/consistency, you need to perform more measurements (i.e. increase the number of samples). The text `variance is severely inflated by outliers` also suggests that the thing that you're benchmarking can't really be timed accurately because it depends too heavily on external systems. Try creating more isolated test-cases that only cover parts of the system, or to replace external stuff like databases with dummy systems that return results immediately. — dflemstr, Jul 29 '12 at 19:34
"I've got a lot of tabs and crap open" <- close that. Small performance differences are difficult enough to measure if the OS is the only thing trying to push your benchmark off the CPU. The threefold difference, however would need extreme load differences, so it's probably not caused by that. — Daniel Fischer, Jul 29 '12 at 19:49
thanks @dflemstr. I think I should try to make some more proper micro-benchmarks as you suggest. And I was assuming criterion was throwing out the samples identified as outliers, but I suppose not. In any case the STD between both runs does not suggest the variation I got — jberryman, Jul 29 '12 at 19:50

John L · Accepted Answer · 2012-07-30T15:27:17.400

Although there's certainly not enough information here to pinpoint every issue, I have a few suggestions that may help.

Interpreting Criterion results

The problem with the samples identified as outliers is that criterion can't really tell if they're outliers because they're junk data, or if they're valid data that's different for some legitimate reason. It can strongly hint that they're junk (the "variance is severely inflated" line), but what this really means is that you need to investigate your testing environment, your tests, or your application itself to determine the source of the outliers. In this case it's almost certainly caused by system load (based on other information you've provided).

You might be interested to read BOS's announcement of criterion, which explains how it works in quite a bit more detail and goes through some examples of exactly how system load affects the benchmarking process.

I'm very suspicious of the difference in the "estimated cost of a clock call". Notice that there is a high proportion of outliers (in both runs), and those outliers have a "high severe" impact. I would interpret this to mean that the clock timings criterion picked up are junk (probably in both runs), making everything else unreliable too. As @DanielFischer suggests, closing other applications may help this problem. Worst case might be a hardware problem. If you close all other applications and the clock timings are still unreliable, you may want to test on another system.

If you're running multiple tests on the same system, the clock timings and cost should be fairly consistent from run to run. If they aren't, something is affecting the timings, resulting in unreliable data.

Aside from that, here are two random ideas that may be a factor.

CPU load

The threaded runtime can be sensitive to CPU load. The default RTS values work well for many applications unless your system is under heavy load. The problem is that there are a few critical sections in the garbage collector, so if the Haskell runtime is resource starved (because it's competing for CPU or memory with other applications), all progress can be blocked waiting for those sections. I've seen this affect performance by a factor of 2.5, which is more or less in line with the three-fold difference you see.

Even if you don't have issues with the garbage collector, high CPU load from other applications will skew your results and should be eliminated if possible.

how to diagnose

Use top or other system utilities to check CPU load.
Run with +RTS -s. At the bottom of the statics, look for these lines

-RTS -s output

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

non-zero values indicate resource contention in the garbage collector. Large values here indicate a serious problem.

how to fix

close other applications
specify that your executable should use fewer than all cores (e.g. +RTS -N6 or +RTS -N7 on an 8-core box)
disable parallel garbage collection (with +RTS -qg). I've usually had better results by leaving a free core than disabling the parallel collector, but YMMV.

I/O

If the functions you're benchmarking are doing any sort of I/O (disk, network, etc.), you need to be very careful in how you interpret the results. Disk I/O can cause huge variances in performance. If you run the same function for 100 samples, after the first run any I/O might be cached by the controller. Or you may have to do a head seek if another file was accessed between runs. Other I/O typically isn't any better.

how to diagnose

you probably already know if your function is doing I/O.
tools like lsof can help track down mysterious I/O performance

how to fix

mock the I/O. Create a ramdisk. Anything other than actually going to the hard drive etc.
If you really must benchmark real I/O operations, minimize interference from other applications. Maybe use a dedicated drive. Close other apps. Definitely collect multiple samples, and pay attention to the variance between them.

score 6 · Answer 2 · answered Feb 21 '17 at 22:31

On Linux > 2.6 there's a handy feature called "cpusets" that can be used to reserve CPU cores for certain processes. Of course this can only help to reduce variance from shared CPU usage but I find it to be quite effective at that.

Here's my current workflow (requires the cpuset package):

$ # Reserve virtual cores 0 and 1, corresponding to a full physical CPU core
$ sudo cset shield -k on -c 0-1
$ # Run benchmarks with my usual user
$ sudo cset shield --user=$USER -e -- stack bench 
$ # Release the CPU cores
$ sudo cset shield --reset

Here's a tutorial for the cset command.

How do I get consistent criterion benchmarks, or interpret results across runs?

Details on my particular case

Example output

2 Answers2

Interpreting Criterion results

CPU load

I/O