C#: Is this benchmarking class accurate?

Question

I created a simple class to benchmark some methods of mine. But is it accurate? I am kind of new to benchmarking, timing, et cetera, so thought I could ask for some feedback here. Also, if it is good, maybe somebody else can make use of it as well :)

public static class Benchmark
{
    public static IEnumerable<long> This(Action subject)
    {
        var watch = new Stopwatch();
        while (true)
        {
            watch.Reset();
            watch.Start();
            subject();
            watch.Stop();
            yield return watch.ElapsedTicks;
        }
    }
}

You can use it like this:

var avg = Benchmark.This(() => SomeMethod()).Take(500).Average();

Any feedback? Does it look to be pretty stable and accurate, or have I missed something?

+1 very original implementation. I'll be trying it out to gauge if its accurate *enough* for the type of off the cuff testing I do. Thx! — Zach Bonham, Oct 02 '09 at 04:01
Can you not just use `var avg = Benchmark.This(SomeMethod).Take(500).Average();` instead of using a lambda? — Lucas Jones, Oct 02 '09 at 17:21
In this example I actually could, yes. Just wrote it like that to make it clearer that I was sending in an Action that could potentially be a lambda or anything. — Svish, Oct 02 '09 at 18:24
If you are using VS 2010 beta2, and copying Svish's code into a new class created by Project/Add Class : note that a reference to using System.Collections; is not automatically added to the class ... and you will need it. — BillW, Nov 01 '09 at 10:15
You can use this open-source framework: [BenchmarkDotNet](https://github.com/AndreyAkinshin/BenchmarkDotNet). It include Stopwatch using, GC pre-call, warmup, set priority of process, thred, ProcessorAffinity-mask, api for benchmark competitions and nice console output with results. — AndreyAkinshin, Sep 12 '13 at 16:09

score 21 · Accepted Answer · edited Feb 01 '13 at 20:46

It is about as accurate as you can get for a simple benchmark. But there are some factors not under your control:

load on the system from other processes
state of the heap before/during the benchmark

You could do something about that last point, a benchmark is one of the rare situations where calling GC.Collect can be defended. And you might call subject once beforehand to eliminate any JIT issues. But that requires calls to subject to be independent.

public static IEnumerable<TimeSpan> This(Action subject)
{
    subject();     // warm up
    GC.Collect();  // compact Heap
    GC.WaitForPendingFinalizers(); // and wait for the finalizer queue to empty

    var watch = new Stopwatch();
    while (true)
    {
        watch.Reset();
        watch.Start();
        subject();
        watch.Stop();
        yield return watch.Elapsed;  // TimeSpan
    }
}

For bonus, your class should check the System.Diagnostics.Stopwatch.IsHighResolution field. If it is off, you only have a very coarse (20 ms) resolution.

But on an ordinary PC, with many services running in the background, it is never going to be very accurate.

It's not a bad idea to wait for pending finalizers after that collection. Remember, the finalizers run on a different thread; the finalizers for the previous run could be running on another thread while the current run is going. — Eric Lippert, Oct 02 '09 at 17:01
All good ideas. I've implemented them in my class. Only thing I changed was to just return the `TimeSpan` instead of the `ElapsedMilliseconds` or `Ticks` :) — Svish, Oct 02 '09 at 21:37

score 10 · Answer 2 · answered Oct 02 '09 at 17:07

Couple problems here.

First, remember that the first time you run the code, the transitive closure of its method calls will be jitted. That means that the first run is likely to have higher cost than every subsequent run. Depending on whether you are benchmarking "cold" timings or "hot" timings, this could make a difference. I have seen methods where the cost of jitting the method was higher than every other call to it put together!

Second, remember that the garbage collector runs on another thread. If you are making garbage in one run, then the cost of cleaning up that garbage might not be realized until suebsequent runs. You are therefore failing to account for the total cost of one run, by foisting it off onto later runs.

Both of these are indicative of the weakness of all benchmarking: benchmarking is by nature unrealistic, and therefore of limited value. In real-world code, the GC is going to be running, the jitter is going to be running, and so on. It is frequently the case that benchmarked performance is nothing at all like real-world performance because the benchmark does not take into account the variability of real-world costs inherent in a large system. Rather than analyzing perf characteristics in isolation, I prefer to look at the perf characteristics of realistic scenarios actually faced by real customers.

score 7 · Answer 3 · edited Nov 07 '09 at 09:49

7

You should definitely return ElapsedMilliseconds instead of ElapsedTicks. The value returned by ElapsedTicks is dependent upon the Stopwatch frequency, which can be different on different systems. It will not necessarily correspond to the Ticks property of a Timespan or DateTime object.

See http://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.elapsedticks.aspx.

If you do want the extra resolution of Ticks, you should return watch.Elapsed.Ticks (i.e. Timestamp.Ticks) instead of watch.ElapsedTicks (this might be one of the subtlest potential errors in .Net). From MSDN:

Stopwatch ticks are different from DateTime.Ticks. Each tick in the DateTime.Ticks value represents one 100-nanosecond interval. Each tick in the ElapsedTicks value represents the time interval equal to 1 second divided by the Frequency.

Other than that, I guess your code is fine, although I think you'd be including some of the method-calling overhead in your measurements, which might be significant if the methods themselves take very little time to execute. Also, you probably would want to exclude the first call to the method from your calculated average, but I'm not sure how you'd do that in your class.

One last point, which would probably not be relevant to most uses of this class: Stopwatch runs a bit fast compared to the system time. On my computer, it gets about 5 seconds (that's seconds, not milliseconds) ahead after 24 hours, and on other machines this drift can be even larger. So it's a little misleading to say it's highly accurate, when it's actually just highly granular. For timing short-duration methods, this obviously wouldn't be a significant problem.

And one more last point, which certainly is relevant: I've often noticed while benchmarking that I'll get a bunch of running times that are all clustered within a narrow range of values (e.g. 80, 80, 79, 82 etc.), but occasionally something else will happen in Windows (like opening another program or my anti-virus kicks on or something) and I'll get a value wildly out of whack with the others (e.g. 80, 80, 79, 271, 80 etc.). I think a simple solution to this outlier problem is to use the median of your measurements instead of the mean. I don't know if Linq supports this automatically or not.

edited Nov 07 '09 at 09:49

Ash

60,973
31
151
169

answered Oct 02 '09 at 02:21

MusiGenesis

74,184
40
190
334

You have a point about the method-calling overhead. Can't really get away with that here though... I think... Anyways it shouldn't be much and the methods I am using it on now are either so fast that I don't care if it is inaccurate or not, or so long that the method-calling overhead is tiny in comparison :p – Svish Oct 02 '09 at 02:39
@Svish: the overhead of the method-calling is suprisingly large. To see the difference, trying timing a loop where you multiply two numbers inline vs. passing the two numbers to a method as parameters and multiplying inside that method. – MusiGenesis Oct 02 '09 at 02:48
1

+1 for the catch of ElapsedTicks v. Elapsed.Ticks. Also, have you tried using a percentile to weed out the 'edge' cases? e.g. 90% of iterations were at, or below, a threshold. See http://www.techbookreport.com/tutorials/quantiles.html though I cheat and just use Excel. :) – Zach Bonham Oct 02 '09 at 04:07
@Zach: I learned about ElapsedTicks vs. Elapsed.Ticks the usual way - by getting slapped around on StackOverflow. Personally, I eliminate outliers by employing the "inter-ocular percussion test" (a.k.a. "it hits you right between the eyes"). :) – MusiGenesis Oct 02 '09 at 04:14
`ElapsedTicks` versus `Elapsed.Ticks` is not an error. They're two different measures that happen to have the same name because of a coincidence. Ticks in a `Stopwatch` context is the number of timer intervals, and is dependent on timer resolution. It is obviously a useful measure to have, as it represents the maximum accuracy you can get. `Ticks` in a `DateTime` context is just a number of 100 nanosecond intervals, which is a commonly used way to represent time intervals on computers. – Joren Oct 02 '09 at 09:48
1

@Joren: using ElapsedTicks is not necessarily an error. However, this is a class for benchmarking, so it almost certainly would end up being an error if the user of the class dumped the ElapsedTicks value into a TimeSpan in order to calculate the elapsed duration. I should have said "this might be one of the subtlest *potential* errors in .Net". In fact, I *will* say that. – MusiGenesis Oct 02 '09 at 12:42
Would it maybe make more sense to just return the Elapsed TimeSpan itself? – Svish Oct 02 '09 at 14:34
Oh, and also, very good idea to use a median instead of a mean. Not sure if Linq has it, probably not. But shouldn't be hard to create anyways :) – Svish Oct 02 '09 at 14:35
@Svish: would that work with `yield`? I would just return ElapsedMilliseconds - you're never really going to get sub-millisecond accuracy anyway, no matter how you do this. – MusiGenesis Oct 02 '09 at 14:38
As long as I changed the return type to TimeSpan I don't see a problem with it. TimeSpan is just a regular struct, as far as I know. And about the accuracy, I don't really need need sub-millisecond accuracy. But I want to see the number. Not because I want to know the exact running time, but because I want to compare it to the running time of other methods. I am using it to compare implementations of the same thing, kind of. If that made sense :p – Svish Oct 02 '09 at 18:28
`ElapsedTotalMilliseconds` gives more accurate result than `ElapsedMilliseconds`. Otherwise +1 for *mean*, but I take *mode* usually :) – nawfal Apr 22 '13 at 05:58

score 2 · Answer 4 · answered Oct 02 '09 at 02:20

2

As I am not a C# programmer, I cannot say with any deal of accuracy whether that class is an appropriate implementation for counting how long a function execution takes. However, there are things to keep in mind for repeatability and accuracy.

I am not up on the various ins and outs of the .NET Framework, but depending on how it compiles to native code, it might be possible that any compilation would affect benchmark results. Also, whether or not a function is in the cache can make a difference, too. So you'll want to loop over your function to ensure that there is no hit from compilation and that everything is loaded and ready. Once that's done, you might be able to get started.

Others will probably have better information and knowledge of .NET than I do.

answered Oct 02 '09 at 02:20

ZachS

1,238
7
11

Normally a compiled .Net application isn't a native EXE, but your points are all pretty much valid anyway. An application running in Debug mode in Visual Studio will generally run slower than a .Net EXE started normally, and also functions will generally run slower the first time they're called (especially if they make calls into other Assemblies that have to be loaded), so it makes sense to call a method the first time without including its running time in the average measurement. – MusiGenesis Oct 02 '09 at 02:29
Good idea, should edit it so that it runs the method once before starting to yield results :) – Svish Oct 02 '09 at 02:37

C#: Is this benchmarking class accurate?

4 Answers4

Linked