Best practices for measuring the run-time complexity of a piece of code

Question

I have a gnarly piece of code whose time-efficiency I would like to measure. Since estimating this complexity from the code itself is hard, I want to place it in a loop and time the results. Once enough data-points (size -> time) are gathered, I can see which curve fits best.

Repeating operations a number of times with random input data of a given size can smooth out fluctuations due to the OS deciding to multitask at bad moments, yielding more precise times. Increasing the size of the problem provides more points, ideally well spaced.

My test-code works fine (initial, non-timed warm-up loop to decrease load-times; and then, starting from a size of 10, scaling up to 1000000 in increments of a factor of 10%, repeating runs until 5s have elapsed or 5 full runs have finished). However, I arrived at these numbers by guess-work.

Is there an accepted, "scientific" way to scale repetitions and problem size to achieve faster, more accurate time-vs-size plots? Is there code out there (or libraries) that can scaffold all the boring bits, and which I should have been aware of before rolling-my-own? In particular, I can think that, when bumps in timings are found, more measures could be warranted -- while relatively smooth readings could simply be considered "good enough".

Edit

I am aware of the classical method of calculating big-O complexity. It works fine for self-contained algorithms with a nice representative operation (say, "comparisons" or "swaps"). It does not work as advertised when those conditions are not met (example: the compile-time C++ template instantiation costs of LLVM, which is a large and complex and where I do not know what the relevant representative operation would be). That is why I am treating it as a black box, and trying to measure times from the outside instead of by code inspection.

Do you mean something like [this](http://stackoverflow.com/questions/22804177/algorithmic-complexity-of-naive-code-for-processing-all-consecutive-subsequences/22811511#22811511)? — Mohamed Ennahdi El Idrissi, May 29 '14 at 22:14
@MohamedEnnahdiElIdrissi not really - the algorithm is much more involved and distributed among a large number of files, so that simple inspection won't cut it. I am trying to understand the compile-time complexity of instantiating a particularly complex template in C++ 11 — tucuxi, May 30 '14 at 08:09
Related question: http://stackoverflow.com/questions/795827/testing-the-performance-of-a-c-app — seaotternerd, May 03 '15 at 20:03
@seaotternerd thanks - I am aware of unix `time' utility and the use of timers in C/C++, and that using the most precise one available is a good idea. However, this does not answer the question of "how to scale repetitions and problem sizes to achieve faster, more accurate time-vs-size plots". — tucuxi, May 04 '15 at 12:57
@tucuxi You should explore the Linux profiling tool [Valgrind/Callgrind](http://www.valgrind.org). It's an absolutely awesome tool in its own right for hotspot detection, and the metric it measures (# of instructions fetched, in Valgrind parlance `Ir`) is a very good measure of work complexity. Run your app for various `n`'s under Callgrind, record the number of instructions fetched, and plot them! Bonus: Look at the _exquisitely detailed, per-function cost breakdowns_ of the entire program, and plot the behaviour of every portion of your program! — Iwillnotexist Idonotexist, May 06 '15 at 11:52
@IwillnotexistIdonotexist Nice - I've used it extensively for leak detection, but had never thought of using it to collect runtime data. Still, the question remains: how many times and at which input sizes should I sample to build a quick&accurate model of time-vs-input size behaviour? — tucuxi, May 06 '15 at 12:43
@tucuxi Try a logarithmic scale. `n=1, 2, 4, 8, 16, 32`. As for # of times to run, assuming your software is deterministic in both operation (no "true" randomness, or time bombs) and relative timing (no threads), running just once suffices, because the same number of instructions will be executed every time. If you have timing nondeterminism but lock contention is relatively light, then running a very limited # of times will give you a feel for the variance. The real problem comes when you access truly-random sources or act on timestamps. Then, only your intuition can help. — Iwillnotexist Idonotexist, May 06 '15 at 14:31

AbcAeffchen · Answer 1 · 2015-05-10T03:31:29.333

Measuring the time complexity can be very difficult (if it is possible at all) and I never saw this in algorithm papers. If you cannot calculate the time-complexity from (pseudo-) code or the algorithm description, then maybe you can use a heuristic to simplify the analysis.

Maybe you can also calculate the complexity of some parts of the algorithm and ignore some other parts if they have obviously a much smaller complexity.

If nothing helps, the normal way would to show how the algorithm scales on an machine, just as you wrote. But there are many things that effect the results. Just to notice some of them:

Memory types: If your input is small enough to fit into the L1 cache, your algorithm runs very fast, just because the memory is fast. If your input gets bigger, so it doesn't fit into L1 cache any more, it is stored in the L2 cache and if it gets even bigger it is stored in the RAM. And each time your program slows down by a huge factor (in addition to the factor of the growing input). The worst is, when it gets so big that the algorithm has to store some of thin input at your hard disc.
Multitasking: If your OS decides to hand over the CPU to an other program your algorithm seems to slow down. This is also hard to handle.
Hardware: In big-O every operation counts as 1 unit of time. If your algorithm performs a lot of operations, that your CPU is optimized for, this also effects your measurement.
Software: Software can effect your measurement the same way as hardware does. E.g. if you have a lot of big integer operations using a library, you can massively speed up the program by using GMP.
Warmup: If you start measuring, you have to warmup the CPU first. Run the algorithm on a bigger input first (without measuring).
Input cases: You can only run your program on some chosen or random generated input cases of a specific length. In most cases it is hard to tell (or just not possible) if the input causes a shorter or longer run-time. So maybe you test the wrong examples. If you us random inputs, you get more different results.

All in all: I think you can only get an idea, how your algorithm scales, but you cannot exactly get an upper bound of the complexity by measuring the run-time. Maybe this works for really small examples, but for bigger ones you will not get correct results.

The best you can do would be:

Write down the exact hardware and software of the computer you use for measurement.
Repeat the tests multiple times (in different orders)
If you change hard or software you should start from the beginning.
Only use inputs that are all stored in the same memory type, so skip all cases that fit into the cache.

This way you can see if changes have improved the algorithm or not and others can verify your results.

About the input:

You should use worst case inputs, if possible. If you cannot say if an input is a worst case or not, you should use many different cases or random inputs (if possible).
You have to run tests (for each input length), until the average of the run-times stabilizes.

I am mostly interested in "how many times and for what input sizes should timing tests be performed in order to gain a useful idea of run-time complexity". Ideally size & repetitions would respond to uncertainty detected in measurements (smooth, low-variance = no need to insist; spikes, high-variance = more details required). While I am aware that the exact machine & environment will influence times, and that tests should be repeated, and that I/O times may prove important -- I still consider the main question to be unanswered. — tucuxi, May 04 '15 at 12:52
@tucuxi I know that I not answered your main question. Sorry for this. There are so many problems, that your should try to use some theory and heuristics to calculate the complexity or some (maybe bad) upper and lower bounds. Then you can (in addition) say, that the code runs about an other measured complexity. What you do is basically a statistics. The number of test-runs for each input length highly depends on the algorithm. You will have to try some different numbers of repeats (the more the better) and you should attach the exact number (used at the end) to your final results. — AbcAeffchen, May 04 '15 at 13:15
Given a black box, with a positive-but-noisy relation between sample size and calculation time, how would you test it to characterize this relation as quickly and accurately as possible? It is surely possible to time it badly (I have!). What would be the *good* way to do it? I *expect* the correct actions to depend on the measurements, and am sure that someone has described how to vary measurements depending on observed outputs. I strongly suspect that this is routine for experimental sciences such as physics or chemistry. — tucuxi, May 04 '15 at 13:29
@tucuxi I'm working on an randomized algorithm using lattices. I cannot prove the complexity of this algorithm, but I (we) hope, that it runs in polynomial time. We ran some single tests and noticed, that for short inputs, the running time varies a lot. So we ran more tests until the average doesn't vary that much, if you add an other result. For long inputs the algorithm is quite stable, so we run much less tests. I'm not a physicist, so I don't know how they do it. I hope you get some more answers that are more helpful than mine. — AbcAeffchen, May 04 '15 at 13:40
Thanks @AbcAeffchen - I do appreciate your looking into the question. — tucuxi, May 04 '15 at 14:37

score 2 · Answer 2 · answered May 05 '15 at 23:58

I'm not aware of any software for this, or previous work done on it. And, fundamentally, I don't think you can get answers of the form "O(whatever)" that are trustworthy. Your measurements are noisy, you might be trying to distinguish nlog(n) operations from nsqrt(n) operations, and unlike a nice clean mathematical analysis, all of the dropped constants are still floating around messing with you.

That said, the process I would go through if I wanted to come up with a best estimate:

Making sure I record as much information as possible through the whole process, I'd run the thing I want to measure on as many inputs (and sizes) as I could before I got bored. Probably overnight. Repeated measurements for each input and size.
Shovel the input size to time data into a trial copy of Eureqa and see what pops out.
If I'm not satisfied, get more data, continue to shovel it into Eureqa and see if the situation is improving.
Assuming Eureqa doesn't give an answer I like before I get bored of it consuming all of my CPU time and power, I'd switch over to Bayesian methods.
Using something like pymc I'd attempt to model the data using a bunch of likely looking complexity functions. (n, n^2, n^3, n^3, n*log(n), n^2*log(n) n^2*log(n)^2, etc, etc, etc).
Compare the DIC (smaller is better) of each model, looking for the best few.
Plot the best few, look for spots where data and model disagree.
Collect more data near disagreements. Recompute the models.
Repeat 5-8 until bored.
Finally, collect some new data points at larger input sizes, see which model(s) best predict those data points.
Choose to believe that one of them is true enough.

While you provide insightful ideas on how to validate models once the data is gathered, only step 8 partially addresses the main question: how often and where to sample based on sampling outcomes. — tucuxi, May 06 '15 at 08:18

score 2 · Answer 3 · answered May 06 '15 at 13:39

2

First things first, I don't know of an accepted, "scientific" way to scale repetitions and problem size to achieve faster, more accurate time-vs-size plots, so I cannot say anything on the matter.

Other than that, for a better measurement of time complexity I would suggest to measure the average execution time for a fixed size and compare it with the average execution time measured in the previous cycle. After that you increase the size of the input data and repeat the measurement.

This is similar to one of the methods used in Numerical Analysis to estimate errors of numerical methods. You just adapt it to estimate the average error in the execution time of the implementation of your algorithm.

So, to cut it short:

Pick a size for the input data.
Run the implementation of the algorithm.
Calculate the average execution time.
If you have a previous calculation of the average execution time then compare it with the current execution time, else repeat from point 2.
If difference is greater than or equal to a certain (previously defined) threshold quantity then repeat from point 2.
Increase the size of the input data.
Repeat from point 2.

Let me know if something is unclear.

answered May 06 '15 at 13:39

Gentian Kasa

776
6
10

Your proposed algorithm seems more fragile than that Jay's: it would treat each size independently of others, and would not attempt to gather more samples in sizes where sudden spikes are observed within the overall plot. – tucuxi May 06 '15 at 13:55
The size is increased only when we have a sattisfactory measurement of the average time. For example: say we have a set of 1000 items and a difference threshold of 1 second, we repeat the measurement until the `|current average execution time - previous average execution time| <= 1 second`. After that we increase the number of items in the input set and measure the average time again. – Gentian Kasa May 06 '15 at 14:04
Awarding you the bounty for "best workable answer to the question yet" (Jay's is much more vague on details), although I still suspect there could be better answers out there. – tucuxi May 06 '15 at 14:10
I'm sure there are. I would ask a favour if it's ok with you: being that I'm personally interested in algorithms and complexity, and being this a problem with no solution (as far as I know), would you be so kind and drop a message if you find a better solution on it? I'd be really grateful. – Gentian Kasa May 06 '15 at 14:18
Mark the star next to the problem title - this will notify you whenever its status changes – tucuxi May 06 '15 at 14:20

Ami Tavory · Answer 4 · 2015-05-06T14:49:41.737

1

Suppose you run the following in a loop. At iteration i = 0, 1, 2, ...., for some fixed n_0 > 0 and very large n, you sample the function at 2 i + n_0 equi-distanced (up to rounding) points in the range 1, ..., n. You then do either one of the following or a combination of both:

Train a spline using the even points and test it on the odd points (and also vice versa). Decide the iteration is enough if the l2 error is below some threshold.
Train a spline using all the points, and test it on the values at, say 2n. Again, decide the iteration is enough if the l2 error is below some threshold.

Point 1. emphasizes the interpolation error, and point 2. emphasizes the extrapolation error. Realistically speaking, I think you will at best be able to identify functions described by a spline.

Depending on the fitting method you use, you might need to fit some meta-parameters for the spline method. In this case, you might need to use more than ~2 i samples per iteration, as you might need to use some of them for parameter-tuning cross validation.

edited May 06 '15 at 14:49

answered May 06 '15 at 11:11

Ami Tavory

74,578
11
141
185

This is the first answer that actually addresses the bolded part of the question. I would, however, prefer more details and ideally references regarding this approach. Is it your idea, or have you seen this used elsewhere? – tucuxi May 06 '15 at 12:40
As an aside, I do not plan to "identify" the functions as belonging to a given family; I expect to "characterize", with a degree of confidence, the shape of the curve. – tucuxi May 06 '15 at 14:03
Regarding your first point: no, I did not use it directly for this type of problem specifically, nor am I aware of others who have. I have done similar stuff in other domains, so the question is whether this would work well here. (Unsurprisingly,) this is what I personally would try to do first in this case, though. – Ami Tavory May 06 '15 at 14:26

score 0 · Answer 5 · answered Mar 16 '17 at 00:12

Use the "ratio method" if you are trying to get a black-box estimate of the complexity. For instance: if you sit in a tight loop doing a fixed length job, like inserting a random record into a database, you record a timestamp at the end of each iteration. The timestamps will start to be farther apart as more data goes in. So, then graph the time difference between contiguous timestamps.

If you divide that graph by lg[n] and it continues to rise, then it's worse than lg[n]. Try dividing by: lg[n], n, nlg[n], nn, etc. When you divide by a function that is too high of an estimate, then plot will trend to zero. When you divide by a function that is too low, then the plot will continue to climb. When you have a good estimate, then there is a point in your data set at which you can place an upper and lower bound where the graph wanders around in for as far out as you care to check.

Best practices for measuring the run-time complexity of a piece of code

5 Answers5

Linked