How to accurately measure performance of sorting algorithms

Question

I have a bunch of sorting algorithms in C I wish to benchmark. I am concerned regarding good methodology for doing so. Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method. How do I minimize the effect of said variables on the benchmark's results?

To give you a few examples, I've considered multiple implementations on two different languages to adjust for the first two variables. Moreover I could compile the code with different compilers on fairly mundane (and specified) arguments. Now I'm going to be running the test on my machine, which features turbo boost and whatnot and often boosts a core running stuff to the moon. Of course I will be disabling that and doing multiple runs and likely taking their mean completion time to adjust for that as well. Regarding the input data, I will be taking different array sizes, from very small to relatively large. I do not know what the increments should ideally be like, and what the range of the elements should be as well. Also I presume duplicate elements should be allowed.

I know that theoretical analysis of algorithms accounts for all of these methods, but it is crucial that I complement my study with actual benchmarks. How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected? I'm comfortable with the technologies I'm working with, less so with strict methodology for studying a topic. Thank you.

Peter Cordes · Accepted Answer · 2021-06-30T17:50:31.490

You can't benchmark abstract algorithms, only specific implementations of them, compiled with specific compilers running on specific machines.

Choose a couple different relevant compilers and machines (e.g. a Haswell, Ice Lake, and/or Zen2, and an Apple M1 if you can get your hands on one, and/or an AArch64 cloud server) and measure your real implementations. If you care about in-order CPUs like ARM Cortex-A53, measure on one of those, too. (Simulation with GEM5 or similar performance simulators might be worth trying. Also maybe relevant are low-power implementations like Intel Silvermont whose out-of-order window is much smaller, but also have a shorter pipeline so smaller branch mispredict penalty.)

If some algorithm allows a useful micro-optimization in the source, or that a compiler finds, that's a real advantage of that algorithm.

Compile with options you'd use in practice for the use-cases you care about, like clang -O3 -march=native, or just -O2.

Benchmarking on cloud servers makes it hard / impossible to get an idle system, unless you pay a lot for a huge instance, but modern AArch64 servers are relevant and may have different ratios of memory bandwidth vs. branch mispredict costs vs. cache sizes and bandwidths.

(You might well find that the same code is the fastest sorting implementation on all or most of the systems you test one.

Re: sizes: yes, a variety of sizes would be good.

You'll normally want to test with random data, perhaps always generated from the same PRNG seed so you're sorting the same data every time.

You may also want to test some unusual cases like already-sorted or almost-sorted, because algorithms that are extra fast for those cases are useful.

If you care about sorting things other than integers, you might want to test with structs of different sizes, with an int key as a member. Or a comparison function that does some amount of work, if you want to explore how sorts do with a compare function that isn't as simple as just one compare machine instruction.

As always with microbenchmarking, there are many pitfalls around warm-up of arrays (page faults) and CPU frequency, and more. Idiomatic way of performance evaluation?

taking their mean completion time

You might want to discard high outliers, or take the median which will have that effect for you. Usually that means "something happened" during that run to disturb it. If you're running the same code on the same data, often you can expect the same performance. (Randomization of code / stack addresses with page granularity usually doesn't affect branches aliasing each other in predictors or not, or data-cache conflict misses, but tiny changes in one part of the code can change performance of other code via effects like that if you're re-compiling.)

If you're trying to see how it would run when it has the machine to itself, you don't want to consider runs where something else interfered. If you're trying to benchmark under "real world" cloud server conditions, or with other threads doing other work in a real program, that's different and you'd need to come up with realistic other loads that use some amount of shared resources like L3 footprint and memory bandwidth.

One thing I generally also do (especially for smaller tests) is usually copy the initialized memory to be sorted to N other copies and sort one after another instead of retiming / reinitializing after each iteration. You can't really benchmark the same region multiple times in a loop because for many algorithms (all?) sorted vs random vs etc... will perform differently. Generally on intel N < LSD threshold (22 ish) and if possible N small enough so that N * region size fits in same cache level as region size. — Noah, Jul 01 '21 at 20:38

score 2 · Answer 2 · edited Jun 30 '21 at 21:18

Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method.

Let's look at this from a very different perspective - how to present information to humans.

With 2 variables you get a nice 2-dimensional grid of results, maybe like this:

        A = 1        A = 2

B = 1   4 seconds    2 seconds

B = 2   6 seconds    3 seconds

This is easy to display and easy for humans to understand and draw conclusions from (e.g. from my silly example table it's trivial to make 2 very different observations - "A=1 is twice as fast as A=2 (regardless of B)" and "B=1 is faster than B=2 (regardless of A)").

With 3 variables you get a 3-dimensional grid of results, and with N variables you get an N-dimensional grid of results. Humans struggle with "3-dimensional data on 2-dimensional screen" and more dimensions becomes a disaster. You can mitigate this a little by "peeling off" a dimension (e.g. instead of trying to present a 3D grid of results you could show multiple 2D grids); but that doesn't help humans much.

Your primary goal is to reduce the number of variables.

To reduce the number of variables:

a) Determine how important each variable is for what you intend to observe (e.g. "which algorithm" will be extremely important and "which language" will be less important).

b) Merge variables based on importance and "logical grouping". For example, you might get three "lower importance" variables (language, compiler, compiler options) and merge them into a single "language+compiler+options" variable.

Note that it's very easy to overlook a variable. For example, you might benchmark "algorithm 1" on one computer and benchmark "algorithm 2" on an almost identical computer, but overlook the fact that (even though both benchmarks used identical languages, compilers, compiler options and CPUs) one computer has faster RAM chips, and overlook "RAM speed" as a possible variable.

Your secondary goal is to reduce number of values each variable can have.

You don't want massive table/s with 12345678 million rows; and you don't want to spend the rest of your life benchmarking to generate such a large table.

To reduce the number of values each variable can have:

a) Figure out which values matter most

b) Select the right number of values in order of importance (and ignore/skip all other values)

For example, if you merged three "lower importance" variables (language, compiler, compiler options) into a single variable; then you might decide that 2 possibilities ("C compiled by GCC with -O3" and "C++ compiled by MSVC with -Ox") are important enough to worry about (for what you're intending to observe) and all of the other possibilities get ignored.

How do I minimize the effect of said variables on the benchmark's results?

How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected?

By identifying the variables (as part of the primary goal) and explicitly deciding which values the variables may have (as part of the secondary goal).

You've already been doing this. What I've described is a formal method of doing what people would unconsciously/instinctively do anyway. For one example, you have identified that "turbo boost" is a variable, and you've decided that "turbo boost disabled" is the only value for that variable you care about (but do note that this may have consequences - e.g. consider "single-threaded merge sort without the turbo boost it'd likely get in practice" vs. "parallel merge sort that isn't as influenced by turning turbo boost off").

My hope is that by describing the formal method you gain confidence in the unconscious/instinctive decisions you're already making, and realize that you were very much on the right path before you asked the question.

How to accurately measure performance of sorting algorithms

2 Answers2