How do you reason about fluctuations in benchmarking data?

Question

Suppose you're trying to optimize a function and using some benchmarking framework (like Google Benchmark) for measurement. You run the benchmarks on the original function 3 times and see average wall clock time/CPU times of 100 ms, 110 ms, 90 ms. Then you run the benchmarks on the "optimized" function 3 times and see 80 ms, 95 ms, 105 ms. (I made these numbers up). Do you conclude that your optimizations were successful?

Another problem I often run into is that I'll go do something else and run the benchmarks later in the day and get numbers that are further away than the delta between the original and optimized earlier in the day (say, 80 ms, 85 ms, 75 ms for the original function).

I know there are statistical methods to determine whether the improvement is "significant". Do software engineers actually use these formal calculations in practice?

I'm looking for some kind of process to follow when optimizing code.

Jérôme Richard · Answer 1 · 2022-05-19T17:20:45.697

2

Do you conclude that your optimizations were successful?

No. 3 runs is not enough especially due to the huge variation and the fact that some timings of the two groups are mixed once merged and sorted.

For small timings like this, the first run should be removed and at least dozens of runs should be performed. I would personally use at least hundreds of runs.

Do software engineers actually use these formal calculations in practice?

Only very few developers does advanced statistical analysis. It is often not needed to do something very formal when the gab before/after the target optimization is huge and the variation within groups is small.

For example, if your program is twice faster than before with a min-max variation of <5%, then you can quite safely say that the optimization is successful. That being said, it is sometimes not the case due to unexpected external factors (though it is very rare when the gap is so big).

If the result is not obvious, then you need to do some statistic basics. You need to compute the standard deviation, the mean and median time, remove the first run, interleave runs and use many runs (at least dozens). The distribution of the timings almost always follow a normal distribution due to the central limit theorem. It is sometimes a mixture distribution due to the threshold effects (eg. caching). You can plot the value to see that easily if you see some outliers in timings.

If there are threshold effects, then you need to apply an advanced statistical analysis but this is complex to do and generally it is not an expected behaviour. I is generally a sign that the benchmark is biased, there is a bug or a complex effect you have to consider during the analysis of the result anyway. Thus, I strongly advise you to fix/mitigate the problem before analysing the results in that case.

Assuming the timings follow a normal distribution, you can just check if the median is close to the mean and if the standard deviation is small compare to the gap between the mean.

A more formal way to do that is to compute the Student t-test and its associated p-value and check the significance of the p-value (eg. <5%). If there are more groups, An Anova can be used. If you are unsure about the distribution, you can apply non-parametric statistical tests like the Wilcoxon and Kruskal-Wallis tests (note that the statistical power of these test is not the same). In practice, doing such a formal analysis is time-consuming and it is generally not so useful compare to a naive basic check (using the mean and standard deviation) unless your modification impacts a lot of users or you plan to write research papers.

Keep in mind that using a good statistical analysis does not prevent biased benchmarks. You need to minimize the external factors that can cause biased results. One frequent bias is frequency scaling: the first benchmark can be faster than the second because of turbo-boost or it can be slower because the processor can take some time to reach a high frequency. Caches also plays a huge role in benchmark biases. There are many other factors that can cause biases in practice like the compiler/runtime versions, environment variables, configuration files, OS/driver updates, memory alignment, OS paging (especially on NUMA systems), the hardware (eg. thermal throttling), software bugs (it is not rare to find bugs by analysing strange performance behaviours), etc.

As a result, it is critical to make benchmarks as reproducible as possible (by fixing versions and reporting the environment parameters (as well as possibly run the benchmarks in a sandbox if you are paranoid and if it does not affect too much the timings). Software like Nix/Spack help for packaging, and containers like LXD, Docker could help for a more reproducible environment.

Many big software team use automated benchmarking to check the presence of performance regression. Tools can do the run properly and statistical analysis for you regularly. A good example is the Numpy team which use a package called Airspeed Velocity (see the results). The PyPy team also designed their own benchmarking tool. The Linux kernel also have benchmarking suite to check for regression (eg. PTS) and many company focusing on performance have such automated benchmarking tools (often home-made). There are many existing tools for that.

For more information about this topic, please give a look to the great Performance Matters presentation by Emery Berger.

edited May 19 '22 at 17:20

answered May 19 '22 at 17:14

Jérôme Richard

41,678
6
29
59

1

*Assuming the timings follow a normal distribution,* - Which is often a bad assumption. Another strategy is to take the best time over multiple tests, if you expect the performance not to be data dependent but instead caused by varying degrees of resource competition from other processes (e.g. on a cloud VM). Of course, some code may be more sensitive to disturbance than others (e.g. needs a larger footprint to stay in cache, and loses a lot when something disturbs it). Things that are expected to encounter page faults and stuff in real use are a harder problem. – Peter Cordes May 19 '22 at 17:32
2

Also re: common pitfalls in benchmarking, see [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes May 19 '22 at 17:32
2

The best time is not always a good strategy when the distribution is a mixture with huge gap as it only consider the component with the best timings and completely ignore the others. This is a problem since the optimization can effect only the slowest components (eg. network optimization for a system having caches or LUT-based optimizations in an application where the LUT is not always in cache due to effects like cache trashing). This is why at least the median+median+sd timings matters (or similar). Having the best time in addition is still a good idea (but not only this). – Jérôme Richard May 19 '22 at 17:53
1

IMO there is no simple metric when the distribution is as complex as a mixture with gaps between components. This is especially true when the two groups have different distributions. In that case, using the p-value of the Wilcoxon test is certainly better than any simple naive metric. A plot is IMO even better since it does not hide the complexity of the results and introduces less biases tough it is not a good idea to take a decision (ie. "eyeball" stats). Using a scientific approach combined with a specific hardware counter monitoring is certainly even better. – Jérôme Richard May 19 '22 at 18:03
1

Sure, you could easily discover that one method is better under good conditions, where it has a large cache all to itself and/or low latency communication between threads since they all ran on the same socket or CCX. But maybe another method is better under worse conditions, not needing as large a cache footprint but doing more computation or something. – Peter Cordes May 19 '22 at 18:15
2

Even further in that direction are pitfalls of microbenchmarking like training the branch predictors just for one thing in a repeat loop, when really it only runs between other work in the real code. And microbenchmarking can make big unroll factors look good, too, without I-cache pressure from having to run the rest of a large program. Also LUTs staying hot in cache. It's easy to end up optimizing for the huge-problem case at the expense of quick calls for short problems if you're only microbenchmarking in a process by itself, not looking at overall cache misses in situ. – Peter Cordes May 19 '22 at 18:19
1

Indeed, I fully agree. Sadly, it is quite frequent in practice to do this resulting in a more-complex/bigger code and huge waste of time for nothing. It is actually often even worst since future readers of the code will think the code have been carefully optimized and will often not even try to check if the optimizations are actually useful! – Jérôme Richard May 19 '22 at 18:48
Thanks for the input, Jérôme and Peter. What should you do when you suspect a microbenchmark is producing inaccurate results because of the nature of the microbenchmark environment (less I-cache pressure, better trained branch predictor, etc)? For example, one pitfall from the link above is not warming up caches, but maybe in production, this function is mostly run on cold caches, so you do want to benchmark using cold caches on purpose? – bun9 May 20 '22 at 02:26
If you instead benchmark in a "production" environment, there is probably too much noise to draw any conclusions, but if you stick to the microbenchmark, the results are inaccurate anyway. Maybe you introduce some "effects" in the microbenchmark, like cache flushing, to mimic a more realistic environment, but how often do you flush? How much of the cache is flushed? Etc. Seems to be no efficient way to determine these in large distributed systems where there's a million things going on at any moment. – bun9 May 20 '22 at 02:27
1

@bun9 If you suspect a benchmark is biased, then you should investigate if this is actually the case. The best solution I know for that is checking *hardware counters*. If the result negative: then you can be more confident about your benchmark. If the result is positive, then you learn more about a bias to avoid and how to check it. You generally win in both cases. This is how a scientific method works: you gain confidence by failing to prove your method is wrong and not by confirming what you think you know. – Jérôme Richard May 21 '22 at 11:32
1

@bun9 In my domain (high-performance computing), we learn not to do micro-benchmarking unless we really want to study a *specific* case (understand a weird effect), but instead to benchmark *mini-application*. Benchmarking a full application is indeed often not a good idea (flexibility, reproducibility and big noise are often a problem). A mini-app designed to reproduce a simplified kernel of the whole application is often more reliable than a micro-benchmark. This avoid you to spend a lot of time to understand weird effects that only happen in your micro-benchmarks. – Jérôme Richard May 21 '22 at 11:41
1

@bun9 You generally do not need a production environment. In fact, it will likely be not flexible enough and too complex to fully understand/master. However, you need to run your benchmark in a similar environment (eg. same processor architecture, similar number of core, similar RAM/network speed, same major software versions, etc.). The benchmarking environment should be simple and controlled. The key in benchmarking is to reduce *confounding factors* which are far more problematic than just noise as they will bias your conclusion as opposed to noise (assuming the above rules are fulfilled). – Jérôme Richard May 21 '22 at 11:50
1

@bun9 Note that benchmarking black-boxes results in a lot of bias and thus wrong conclusions. You need to extract useful metrics for the thing you want to study, not just analyse the wall-clock time regarding some input. The metric are dependent of what makes your application slow. For example, it can be the number of cache misses, RFO, memory throughput, etc. This requires a good understanding of the target application and the target system but this is AFAIK the only way to get relatively accurate results. There is no free lunch. – Jérôme Richard May 21 '22 at 12:03

maxy · Accepted Answer · 2022-05-20T08:48:11.230

Rule of Thumb

Minimum(!) of each series => 90ms vs 80ms
Estimate noise => ~ 10ms
Pessimism => It probably didn't get any slower.

Not happy yet?

Take more measurements. (~13 runs each)
Interleave the runs. (Don't measure 13x A followed by 13x B.)

Ideally you always randomize whether you run A or B next (scientific: randomized trial), but it's probably overkill. Any source of error should affect each variant with the same probability. (Like the CPU building up heat over time, or a background task starting after run 11.)
Go back to step 1.

Still not happy? Time to admit it that you've been nerd-sniped. The difference, if it exists, is so small that you can't even measure it. Pick the more readable variant and move on. (Or alternatively, lock your CPU frequency, isolate a core just for the test, quiet down your system...)

Explanation

Minimum: Many people (and tools, even) take the average, but the minimum is statistically more stable. There is a lower limit how fast your benchmark can run on a given hardware, but no upper limit much it can get slowed down by other programs. Also, taking the minimum will automatically drop the initial "warm-up" run.
Noise: Apply common sense, just glance over the numbers. If you look a the standard deviation, make that look very skeptical! A single outlier will influence it so much that it becomes nearly useless. (It's not a normal distribution, usually.)
Pessimism: You were really clever to find this optimization, you really want the optimized version to be faster! If it looks better just by chance, you will believe it. (You knew it!) So if you care about being correct, you must counter this tendency.

Disclaimer

Those are just basic guidelines. Worst-case latency is relevant in some applications (smooth animations or motor control), but it will be harder to measure. It's easy (and fun!) to optimize something that doesn't matter in practice. Instead of wondering if your 1% gain is statistically significant, try something else. Measure the full program including OS overhead. Comment out code, or run work twice, only to check if optimizing it might be worth it.

Won't interleaving the runs cause every run to be cold? (Assuming the optimized version is sufficiently different). That's why I've not been interleaving on purpose. — bun9, May 20 '22 at 02:05
Yes, it's possible. You could also interleave A,A,A,B,B,B,A,A,A,... or add a non-measured warmup phase. The point is that you don't get fooled by anything else affecting your PC's performance over seconds or days. — maxy, May 20 '22 at 07:14
@bun9: If they take the same input data format, then no, you can run them on the same data so it stays hot. If you mean branch predictors, then yes, but that's realistic unless your real use-case involves calling the thing you're benchmarking repeatedly in a tight loop, then going back to other work for a while. (And if you mean I-cache and uop-cache, usually both versions of something you're microbbenchmarking will fit into cache at once, at least I-cache, and its hot loops will thus quickly get back into the uop cache.) — Peter Cordes, May 20 '22 at 12:52
@bun9: See also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) re: pitfalls of running one then the other, given the possibility of insufficient warm-up of everything. Also see Jérôme's answer here re: cases when just taking the minimum time isn't always appropriate, e.g. when something just barely fits in cache, any disturbance can make it a lot slower, real-world use cases often won't be the best case. Or any other case when the best case depends on more than just not being interrupted / and the rest of the system being idle. — Peter Cordes, May 20 '22 at 12:55

How do you reason about fluctuations in benchmarking data?

2 Answers2

Rule of Thumb

Explanation

Disclaimer