I'm trying to optimize some code, using criterion to try to compare, for example, the effect of adding INLINE
pragma to a function. But I'm finding results are not consistent between re-compiles/runs.
I need to know how to get results either to be consistent across runs so that I can compare them, or how to assess whether a benchmark is reliable or not, i.e. (I guess) how to interpret the details about variance, "cost of a clock call", etc.
Details on my particular case
This is orthogonal to my main questions above, but a couple things might be causing inconsistency in my particular case:
I'm trying to benchmark
IO
actions usingwhnfIO
because the method usingwhnf
in this example didn't work.my code uses concurrency
I've got a lot of tabs and crap open
Example output
Both of these are from the same code, compiled in the exact same way. I did the first run directly below, made a change and did another benchmark, then reverted and ran the first code again, compiling with:
ghc --make -fforce-recomp -threaded -O2 Benchmark.hs
First run:
estimating clock resolution...
mean is 16.97297 us (40001 iterations)
found 6222 outliers among 39999 samples (15.6%)
6055 (15.1%) high severe
estimating cost of a clock call...
mean is 1.838749 us (49 iterations)
found 8 outliers among 49 samples (16.3%)
3 (6.1%) high mild
5 (10.2%) high severe
benchmarking actors/insert 1000, query 1000
collecting 100 samples, 1 iterations each, in estimated 12.66122 s
mean: 110.8566 ms, lb 108.4353 ms, ub 113.6627 ms, ci 0.950
std dev: 13.41726 ms, lb 11.58487 ms, ub 16.25262 ms, ci 0.950
found 2 outliers among 100 samples (2.0%)
2 (2.0%) high mild
variance introduced by outliers: 85.211%
variance is severely inflated by outliers
benchmarking actors/insert 1000, query 100000
collecting 100 samples, 1 iterations each, in estimated 945.5325 s
mean: 9.319406 s, lb 9.152310 s, ub 9.412688 s, ci 0.950
std dev: 624.8493 ms, lb 385.4364 ms, ub 956.7049 ms, ci 0.950
found 6 outliers among 100 samples (6.0%)
3 (3.0%) low severe
1 (1.0%) high severe
variance introduced by outliers: 62.576%
variance is severely inflated by outliers
Second run, ~3x slower:
estimating clock resolution...
mean is 51.46815 us (10001 iterations)
found 203 outliers among 9999 samples (2.0%)
117 (1.2%) high severe
estimating cost of a clock call...
mean is 4.615408 us (18 iterations)
found 4 outliers among 18 samples (22.2%)
4 (22.2%) high severe
benchmarking actors/insert 1000, query 1000
collecting 100 samples, 1 iterations each, in estimated 38.39478 s
mean: 302.4651 ms, lb 295.9046 ms, ub 309.5958 ms, ci 0.950
std dev: 35.12913 ms, lb 31.35431 ms, ub 42.20590 ms, ci 0.950
found 1 outliers among 100 samples (1.0%)
variance introduced by outliers: 84.163%
variance is severely inflated by outliers
benchmarking actors/insert 1000, query 100000
collecting 100 samples, 1 iterations each, in estimated 2644.987 s
mean: 27.71277 s, lb 26.95914 s, ub 28.97871 s, ci 0.950
std dev: 4.893489 s, lb 3.373838 s, ub 7.302145 s, ci 0.950
found 21 outliers among 100 samples (21.0%)
4 (4.0%) low severe
3 (3.0%) low mild
3 (3.0%) high mild
11 (11.0%) high severe
variance introduced by outliers: 92.567%
variance is severely inflated by outliers
I notice that if I scale by "estimated cost of a clock call" the two benchmarks are fairly close. Is that what I should do to get a real number for comparing?