Is my method of measuring running time flawed?

Question

Sorry, it's a long one, but I'm just explaining my train of thought as I analyze this. Questions at the end.

I have an understanding of what goes into measuring running times of code. It's run multiple times to get an average running time to account for differences per run and also to get times when the cache was utilized better.

In an attempt to measure running times for someone, I came up with this code after multiple revisions.

In the end I ended up with this code which yielded the results I intended to capture without giving misleading numbers:

// implementation C
static void Test<T>(string testName, Func<T> test, int iterations = 1000000)
{
    Console.WriteLine(testName);
    Console.WriteLine("Iterations: {0}", iterations);
    var results = Enumerable.Repeat(0, iterations).Select(i => new System.Diagnostics.Stopwatch()).ToList();
    var timer = System.Diagnostics.Stopwatch.StartNew();
    for (int i = 0; i < results.Count; i++)
    {
        results[i].Start();
        test();
        results[i].Stop();
    }
    timer.Stop();
    Console.WriteLine("Time(ms): {0,3}/{1,10}/{2,8} ({3,10})", results.Min(t => t.ElapsedMilliseconds), results.Average(t => t.ElapsedMilliseconds), results.Max(t => t.ElapsedMilliseconds), timer.ElapsedMilliseconds);
    Console.WriteLine("Ticks:    {0,3}/{1,10}/{2,8} ({3,10})", results.Min(t => t.ElapsedTicks), results.Average(t => t.ElapsedTicks), results.Max(t => t.ElapsedTicks), timer.ElapsedTicks);
    Console.WriteLine();
}

Of all the code I've seen that measures running times, they were usually in the form:

// approach 1 pseudocode
start timer;
loop N times:
    run testing code (directly or via function);
stop timer;
report results;

This was good in my mind since with the numbers, I have the total running time and can easily work out the average running time and would have good cache locality.

But one set of values that I thought were important to have were minimum and maximum iteration running time. This could not be calculated using the above form. So when I wrote my testing code, I wrote them in this form:

// approach 2 pseudocode
loop N times:
    start timer;
    run testing code (directly or via function);
    stop timer;
    store results;
report results;

This is good because I could then find the minimum, maximum as well as average times, the numbers I was interested in. Until now I realized that this could potentially skew results since the cache could potentially be affected since the loop wasn't very tight giving me less than optimal results.

The way I wrote the test code (using LINQ) added additional overheads which I knew about but ignored since I was just measuring the running code, not the overheads. Here was my first version:

// implementation A
static void Test<T>(string testName, Func<T> test, int iterations = 1000000)
{
    Console.WriteLine(testName);
    var results = Enumerable.Repeat(0, iterations).Select(i =>
    {
        var timer = System.Diagnostics.Stopwatch.StartNew();
        test();
        timer.Stop();
        return timer;
    }).ToList();
    Console.WriteLine("Time(ms): {0,3}/{1,10}/{2,8}", results.Min(t => t.ElapsedMilliseconds), results.Average(t => t.ElapsedMilliseconds), results.Max(t => t.ElapsedMilliseconds));
    Console.WriteLine("Ticks:    {0,3}/{1,10}/{2,8}", results.Min(t => t.ElapsedTicks), results.Average(t => t.ElapsedTicks), results.Max(t => t.ElapsedTicks));
    Console.WriteLine();
}

Here I thought this was fine since I'm only measuring the times it took to run the test function. The overheads associated with LINQ are not included in the running times. To reduce the overhead of creating timer objects within the loop, I made the modification.

// implementation B
static void Test<T>(string testName, Func<T> test, int iterations = 1000000)
{
    Console.WriteLine(testName);
    Console.WriteLine("Iterations: {0}", iterations);
    var results = Enumerable.Repeat(0, iterations).Select(i => new System.Diagnostics.Stopwatch()).ToList();
    results.ForEach(t =>
    {
        t.Start();
        test();
        t.Stop();
    });
    Console.WriteLine("Time(ms): {0,3}/{1,10}/{2,8} ({3,10})", results.Min(t => t.ElapsedMilliseconds), results.Average(t => t.ElapsedMilliseconds), results.Max(t => t.ElapsedMilliseconds), results.Sum(t => t.ElapsedMilliseconds));
    Console.WriteLine("Ticks:    {0,3}/{1,10}/{2,8} ({3,10})", results.Min(t => t.ElapsedTicks), results.Average(t => t.ElapsedTicks), results.Max(t => t.ElapsedTicks), results.Sum(t => t.ElapsedTicks));
    Console.WriteLine();
}

This improved overall times but caused a minor problem. I added the total running time in the report by adding each iteration's times but gave misleading numbers since the times were short and didn't reflect the actual running time (which was usually much longer). I needed to measure the time of the entire loop now so I moved away from LINQ and ended up with the code I have now at the top. This hybrid gets the the times I think are important with minimal overhead AFAIK. (starting and stopping the timer just queries the high resolution timer) Also any context switching occurring is unimportant to me as it's part of normal execution anyway.

At one point, I forced the thread to yield within the loop to make sure that it is given the chance at some point at a convenient time (if the test code is CPU bound and doesn't block at all). I'm not too concerned about the processes running which might change the cache for the worse since I would be running these tests alone anyway. However, I came to the conclusion that for this particular case, it was unnecessary to have. Though I might incorporate it in THE final final version if it proves beneficial in general. Perhaps as an alternate algorithm for certain code.

Now my questions:

Did I make some right choices? Some wrong ones?
Did I make wrong assumptions about the goals in my thought process?
Would the minimum or maximum running times really be useful information to have or is it a lost cause?
If so, which approach would be better in general? The time running in a loop (approach 1)? Or the time running just the code in question (approach 2)?
Would my hybrid approach be ok to use in general?
Should I yield (for reasons explained in the last paragraph) or is that more harm to the times than necessary?
Is there a more preferred way to do this that I did not mention?

Just to be clear, I'm not looking for an all-purpose, use anywhere, accurate timer. I just want to know of an algorithm that I should use when I want a quick to implement, reasonably accurate timer to measure code when a library or other 3rd party tools is not available.

I'm inclined to write all my test code in this form should there be no objections:

// final implementation
static void Test<T>(string testName, Func<T> test, int iterations = 1000000)
{
    // print header
    var results = Enumerable.Repeat(0, iterations).Select(i => new System.Diagnostics.Stopwatch()).ToList();
    for (int i = 0; i < 100; i++) // warm up the cache
    {
        test();
    }
    var timer = System.Diagnostics.Stopwatch.StartNew(); // time whole process
    for (int i = 0; i < results.Count; i++)
    {
        results[i].Start(); // time individual process
        test();
        results[i].Stop();
    }
    timer.Stop();
    // report results
}

For the bounty, I would ideally like to have all the above questions answered. I'm hoping for a good explanation on whether my thoughts which influenced the code here well justified (and possibly thoughts on how to improve it if suboptimal) or if I was wrong with a point, explain why it's wrong and/or unnecessary and if applicable, offer a better alternative.

To summarize the important questions and my thoughts for the decisions made:

Is getting the running time of each individual iteration generally a good thing to have?
With the times for each individual iteration, I can calculate additional statistical information like the minimum and maximum running times as well as standard deviation. So I can see if there are factors such as caching or other unknowns may be skewing the results. This lead to my "hybrid" version.
Is having a small loop of runs before the actual timing starts good too?
From my response to Sam Saffron's thought on the loop, this is to increase the likelihood that constantly accessed memory will be cached. That way I'm measuring the times only for when everything is cached, rather than some of the cases where memory access isn't cached.
Would a forced Thread.Yield() within the loop help or hurt the timings of CPU bound test cases?
If the process was CPU bound, the OS scheduler would lower the priority of this task potentially increasing times due to lack of time on the CPU. If it is not CPU bound, I would omit the yielding.

Based on the answers here, I'll be writing my test functions using the final implementation without the individual timings for the general case. If I would like to have other statistical data, I would reintroduce it back into the test function as well as apply the other things mentioned here.

Qwertie · Accepted Answer · 2021-12-24T15:26:06.480

My first thought is that a loop as simple as

for (int i = 0; i < x; i++)
{
    timer.Start();
    test();
    timer.Stop();
}

is kinda silly compared to:

timer.Start();
for (int i = 0; i < x; i++)
    test();
timer.Stop();

the reason is that (1) this kind of "for" loop has a very tiny overhead, so small that it's not worth worrying about even if test() only takes a microsecond, and (2) timer.Start() and timer.Stop() have their own overhead, which is likely to affect the results more than the for loop. That said, I took a peek at Stopwatch in Reflector and noticed that Start() and Stop() are fairly cheap (calling Elapsed* properties is likely more expensive, considering the math involved.)

Make sure the IsHighResolution property of Stopwatch is true. If it's false, Stopwatch uses DateTime.UtcNow, which I believe is only updated every 15-16 ms.

1. Is getting the running time of each individual iteration generally a good thing to have?

It is not usually necessary to measure the runtime of each individual iteration, but it is useful to find out how much the performance varies between different iterations. To this end, you can compute the min/max (or k outliers) and standard deviation. Only the "median" statistic requires you to record every iteration.

If you find that the standard deviation is large, you might then have reason to reason to record every iteration, in order to explore why the time keeps changing.

Some people have written small frameworks to help you do performance benchmarks. For example, CodeTimers. If you are testing something that is so tiny and simple that the overhead of the benchmark library matters, consider running the operation in a for-loop inside the lambda that the benchmark library calls. If the operation is so tiny that the overhead of a for-loop matters (e.g. measuring the speed of multiplication), then use manual loop unrolling. But if you use loop unrolling, remember that most real-world apps don't use manual loop unrolling, so your benchmark results may overstate the real-world performance.

For myself I wrote a little class for gathering min, max, mean, and standard deviation, which could be used for benchmarks or other statistics:

// A lightweight class to help you compute the minimum, maximum, average
// and standard deviation of a set of values. Call Clear(), then Add(each
// value); you can compute the average and standard deviation at any time by 
// calling Avg() and StdDeviation().
class Statistic
{
    public double Min;
    public double Max;
    public double Count;
    public double SumTotal;
    public double SumOfSquares;

    public void Clear()
    {
        SumOfSquares = Min = Max = Count = SumTotal = 0;
    }
    public void Add(double nextValue)
    {
        Debug.Assert(!double.IsNaN(nextValue));
        if (Count > 0)
        {
            if (Min > nextValue)
                Min = nextValue;
            if (Max < nextValue)
                Max = nextValue;
            SumTotal += nextValue;
            SumOfSquares += nextValue * nextValue;
            Count++;
        }
        else
        {
            Min = Max = SumTotal = nextValue;
            SumOfSquares = nextValue * nextValue;
            Count = 1;
        }
    }
    public double Avg()
    {
        return SumTotal / Count;
    }
    public double Variance()
    {
        return (SumOfSquares * Count - SumTotal * SumTotal) / (Count * (Count - 1));
    }
    public double StdDeviation()
    {
        return Math.Sqrt(Variance());
    }
    public Statistic Clone()
    {
        return (Statistic)MemberwiseClone();
    }
};

2. Is having a small loop of runs before the actual timing starts good too?

Which iterations you measure depends on whether you care most about startup time, steady-state time or total runtime. In general, it may be useful to record one or more runs separately as "startup" runs. You can expect the first iteration (and sometimes more than one) to run more slowly. As an extreme example, my GoInterfaces library consistently takes about 140 milliseconds to produce its first output, then it does 9 more in about 15 ms.

Depending on what the benchmark measures, you may find that if you run the benchmark right after rebooting, the first iteration (or first few iterations) will run very slowly. Then, if you run the benchmark a second time, the first iteration will be faster.

3. Would a forced Thread.Yield() within the loop help or hurt the timings of CPU bound test cases?

I'm not sure. It may clear the processor caches (L1, L2, TLB), which would not only slow down your benchmark overall but lower the measured speeds. Your results will be more "artificial", not reflecting as well what you would get in the real world. Perhaps a better approach is to avoid running other tasks at the same time as your benchmark.

Thanks for your answer, and for specifically addressing the important points. — Jeff Mercado, Nov 10 '10 at 04:52
p.s., The timing of individual iterations of the loop wasn't for not timing the overhead of the loop, but for statistical reasons. I agree, that would indeed be silly to do. :) — Jeff Mercado, Nov 11 '10 at 06:06
Well, I'm a bit surprised and thankful to get the rep boost. I'm not really an expert in performance analysis, I just do it sometimes. I hope you learned whatever it was you wanted to learn :). Yeah, sorry to misconstrue your code, since you were using a different timer in each iteration. — Qwertie, Nov 18 '10 at 20:20
One danger in timing a whole repeat loop is compiler optimization across iterations. A compiler might hoist some loop-invariant calculation if it notices, or sink some loop-invariant stores. More aggressive ahead-of-time compilers are more likely to do optimizations like that which are useful in real code but defeat benchmarks. See also the BenchmarkDotNet C# benchmark framework that should help avoid that and other problems mentioned in [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) — Peter Cordes, Dec 24 '21 at 12:45
And BTW, yes, a simple counted loop has overhead of less than a clock cycle of throughput cost per iteration on modern CPU that can issue/rename 4 or 5 uops per clock, with `dec/jnz` only taking one (Intel) or two (AMD). So *even if test() only takes a microsecond* is still glacial by comparison, thousands of clock cycles on a multi-GHz machine. If microbenchmarking a single instruction (especially for throughput instead of latency), you'd want to unroll a few times, but you'd need a repeat loop because the lowest-overhead timing on x86 (`rdtsc`) is way more expensive than a loop iteration. — Peter Cordes, Dec 24 '21 at 15:40

score 4 · Answer 2 · answered Nov 14 '10 at 00:37

Regardless of the mechanism for timing your function (and the answers here seems fine) there is a very simple trick to eradicate the overhead of the benchmarking-code itself, i.e. the overhead of the loop, timer-readings, and method-call:

Simply call your benchmarking code with an empty Func<T> first, i.e.

void EmptyFunc<T>() {}

This will give you a baseline of the timing-overhead, which you can essentially subtract from the latter measurements of your actual benchmarked function.

By "essentially" I mean that there are always room for variations when timing some code, due to garbage collection and thread and process scheduling. A pragmatic approach would e.g. be to benchmark the empty function, find the average overhead (total time divided by iterations) and then subtract that number from each timing-result of the real benchmarked function, but don't let it go below 0 which wouldn't make sense.

You will, of course, have to re-arrange your benchmarking code a bit. Ideally you'll want to use the exact same code to benchmark the empty function and real benchmarked function, so I suggest you move the timing-loop into another function or at least keep the two loops completely alike. In summary

benchmark the empty function
calculate the average overhead from the result
benchmark the real test-function
subtract the average overhead from those test results
you're done

By doing this the actual timing mechanism suddenly becomes a lot less important.

I'm not too concerned about the overhead of the loop or function calls themselves. But that's a great way to cancel it out. — Jeff Mercado, Nov 14 '10 at 18:23

score 2 · Answer 3 · answered Oct 23 '10 at 13:31

2

I think your first code sample seems like the best approach.

Your first code sample is small, clean and simple and doesn't use any major abstractions during the test loop which may introduce hidden overhead.

Use of the Stopwatch class is a good thing as it simplifies the code one normally has to write to get high-resolution timings.

One thing you might consider is providing the option to iterate the test for a smaller number of times untimed before entering the timing loop to warm up any caches, buffers, connections, handles, sockets, threadpool threads etc. that the test routine may exercise.

HTH.

answered Oct 23 '10 at 13:31

Rich Turner

10,800
1
51
68

Sorry, I should have labeled the pieces of code just in case any one referred to it. Were you referring to implementation C (the first code sample listed in the question) or implementation A (the first code attempt)? – Jeff Mercado Oct 23 '10 at 18:12
I have seen the small untimed loop once and thought it was a good idea. I think I'll do that too from now on. – Jeff Mercado Oct 23 '10 at 18:13

score 1 · Answer 4 · edited May 23 '17 at 10:33

I tend to agree with @Sam Saffron about using one Stopwatch rather than one per iteration. In your example you performing 1000000 iterations by default. I don't know what the cost of creating a single Stopwatch is, but you are creating 1000000 of them. Conceivably, that in and of itself could affect your test results. I reworked your "final implementation" a little bit to allow the measurement of each iteration without creating 1000000 Stopwatches. Granted, since I am saving the result of each iteration, I am allocating 1000000 longs, but at first glance it seems like that would have less overall affect than allocating that many Stopwatches. I haven't compared my version to your version to see if mine would yield different results.

static void Test2<T>(string testName, Func<T> test, int iterations = 1000000)
{
  long [] results = new long [iterations];

  // print header 
  for (int i = 0; i < 100; i++) // warm up the cache 
  {
    test();
  }

  var timer = System.Diagnostics.Stopwatch.StartNew(); // time whole process 

  long start;

  for (int i = 0; i < results.Length; i++)
  {
    start = Stopwatch.GetTimestamp();
    test();
    results[i] = Stopwatch.GetTimestamp() - start;
  }

  timer.Stop();

  double ticksPerMillisecond = Stopwatch.Frequency / 1000.0;

  Console.WriteLine("Time(ms): {0,3}/{1,10}/{2,8} ({3,10})", results.Min(t => t / ticksPerMillisecond), results.Average(t => t / ticksPerMillisecond), results.Max(t => t / ticksPerMillisecond), results.Sum(t => t / ticksPerMillisecond));
  Console.WriteLine("Ticks:    {0,3}/{1,10}/{2,8} ({3,10})", results.Min(), results.Average(), results.Max(), results.Sum());

  Console.WriteLine();
}

I am using the Stopwatch's static GetTimestamp method twice in each iteration. The delta between will be the amount of time spent in the iteration. Using Stopwatch.Frequency, we can convert the delta values to milliseconds.

Using Timestamp and Frequency to calculate performance is not necessarily as clear as just using a Stopwatch instance directly. But, using a different stopwatch for each iteration is probably not as clear as using a single stopwatch to measure the whole thing.

I don't know that my idea is any better or any worse than yours, but it is slightly different ;-)

I also agree about the warmup loop. Depending on what your test is doing, there could be some fixed startup costs that you don't want to affect the overall results. The startup loop should eliminate that.

There is proabably a point at which keeping each individual timing result is counterproductive due to the cost of storage necessary to hold the whole array of values (or timers). For less memory, but more processing time, you could simply sum the deltas, computing the min and max as you go. That has the potential of throwing off your results, but if you are primarily concerned with the statistics generated based on the invidivual iteration measurements, then you can just do the min and max calculation outside of the time delta check:

static void Test2<T>(string testName, Func<T> test, int iterations = 1000000)
{
  //long [] results = new long [iterations];
  long min = long.MaxValue;
  long max = long.MinValue;

  // print header 
  for (int i = 0; i < 100; i++) // warm up the cache 
  {
    test();
  }

  var timer = System.Diagnostics.Stopwatch.StartNew(); // time whole process 

  long start;
  long delta;
  long sum = 0;

  for (int i = 0; i < iterations; i++)
  {
    start = Stopwatch.GetTimestamp();
    test();
    delta = Stopwatch.GetTimestamp() - start;
    if (delta < min) min = delta;
    if (delta > max) max = delta;
    sum += delta;
  }

  timer.Stop();

  double ticksPerMillisecond = Stopwatch.Frequency / 1000.0;

  Console.WriteLine("Time(ms): {0,3}/{1,10}/{2,8} ({3,10})", min / ticksPerMillisecond, sum / ticksPerMillisecond / iterations, max / ticksPerMillisecond, sum);
  Console.WriteLine("Ticks:    {0,3}/{1,10}/{2,8} ({3,10})", min, sum / iterations, max, sum);

  Console.WriteLine();
}

Looks pretty old school without the Linq operations, but it still gets the job done.

Ah, I wasn't sure what he may have meant when he mentioned a single stopwatch. Your example made it clearer to me. The amount of memory used isn't so much of an issue for me. But it totally escaped my mind to just get a timestamps which would be easier than keeping stopwatches. — Jeff Mercado, Nov 10 '10 at 04:58

score 0 · Answer 5 · answered Oct 23 '10 at 07:36

0

The logic in Approach 2 feels 'righter' to me, but I'm just a CS student.

I came across this link that you might find of interest: http://www.yoda.arachsys.com/csharp/benchmark.html

answered Oct 23 '10 at 07:36

Quinn1000

55
8

I've always thought so too but I always see it in the form of approach 1 by other developers. So it is either a better approach or they didn't think to try approach 2 (or 3). My observations on the numbers trying these different approaches makes it unclear to me. At least I know I'm not alone on this. :) And thanks for the link. Not really the kind of thing I'm looking for but it's definitely appreciated. And it reminded me of the word I intended to use for a tag: [benchmark]. ;) – Jeff Mercado Oct 23 '10 at 08:14

score 0 · Answer 6 · answered Oct 23 '10 at 13:05

0

Depending on what the running time of the code you're testing is, it's quite difficult to measure the individual runs. If the runtime of the code your testing is multiple seconds, your approach of timing the specific run will most likely not be a problem. If it's in the vicinity of milliseconds, your results will probably very too much. If you e.g. have a context switch or a read from the swap file at the wrong moment, the runtime of that run will be disproportionate to the average runtime.

answered Oct 23 '10 at 13:05

Pieter van Ginkel

29,160
8
71
111

What do you suppose is a good threshold for the minimum running time of the test code? I know the test code should run at least a few milliseconds consistently at the minimum. I had always aimed for at least 10. – Jeff Mercado Oct 23 '10 at 18:16

score 0 · Answer 7 · edited May 23 '17 at 12:26

0

I had a similar question here.

I much prefer the concept of using a single stopwatch, especially if you are micro benchamrking. Your code is not accounting for the GC which can affect performance.

I think forcing a GC collection is pretty important prior to running test runs, also I am not sure what the point is of a 100 warmup run.

edited May 23 '17 at 12:26

Community

1
1

answered Nov 05 '10 at 01:37

Sam Saffron

128,308
78
326
506

A garbage collection at the start sounds like a good idea too. Ensures that there are as much available memory to the calls as possible. I may do that in the future as well as I don't see any reason not to do so. – Jeff Mercado Nov 05 '10 at 02:51
The motivation behind the short warmup prior to the timing was to increase the likelihood that constantly accessed memory will be in the cache by the time you start timing. That way you're not timing the latency to access memory too much but the optimal "everything is in cache." I know this is very important when writing low level, high performance code, especially for multi-core/multi-threaded code. I now consider myself always writing for multi-core. That is if someone who knows better knows this is not necessary, I could be convinced otherwise, or at least provide a better explanation. – Jeff Mercado Nov 05 '10 at 02:53

score 0 · Answer 8 · answered Nov 14 '10 at 00:27

I would lean toward the last, but I'd consider whether the overhead of starting and stopping a timer could be greater than that of looping itself.

One thing to consider though, is whether the effect of CPU cache misses is actually a fair thing to try to counter?

Taking advantage of CPU caches is something where one approach may beat another, but in real world cases there might be a cache-miss with each call so this advantage becomes inconsequential. In this case the approach that made less good use of the cache could become that which has better real-world performance.

An array-based or singly-linked-list-based queue would be an example; the former almost always having greater performance when cache-lines don't get refilled in between calls, but suffering on resize-operations more than the latter. Hence the latter can win in real-world cases (all the more so as they are easier to write in a lock-free form) even though they will almost always lose in the rapid iterations of timing tests.

For this reason it can also be worth trying some iterations with something to actually force the cache to be flushed. Can't think what the best way to do that would be right now, so I might come back and add to this if I do.

The `StopWatch` simply queries the HRT if available (in my case, yes) for both starting and stopping. So there's minimal overhead there. For the general case, I'll assume optimal caching. I'm interested in the algorithm, not how fast my memory is. Though I'll keep that in mind if I get interested in the startup times. — Jeff Mercado, Nov 15 '10 at 03:40
Well, the algorithm does depend on differing memory speeds (in themselves all algorithms are just math, and 1 + 1 doesn't take any time to equal 2). Assuming optimal caching sets the biases regarding memory in a particular direction, assuming bad caching sets it in another, but both biases depend on relative memory speeds. — Jon Hanna, Nov 15 '10 at 08:31

Is my method of measuring running time flawed?

8 Answers8

Linked