Tests show "await" is significantly slower, even when object being awaited is already Complete

Question

I wanted to test the overhead ascribed to a program by using await/async.

To test this, I wrote the following test class:

public class Entity : INotifyCompletion {
    private Action continuation;
    private int i;

    public void OnCompleted(Action continuation) {
        this.continuation = continuation;
    }

    public Entity GetAwaiter() {
        return this;
    }

    public Entity GetResult() {
        return this;
    }

    public bool IsCompleted { get { return true; } }

    public void Execute() {
        if (i > 0) Console.WriteLine("What");
    }
}

And then I wrote a test harness. The test harness iterates through TestA and TestB 1600 times, measuring the latter 1500 times only (to allow the JIT to 'warm up'). set is a collection of Entity objects (but the implementation is irrelevant). There are 50,000 entities in the set. The test harness uses the Stopwatch class for testing.

private static void DoTestA() {
    Entity[] objects = set.GetElements();
    Parallel.For(0, objects.Length, async i => {
        Entity e = objects[i];
        if (e == null) return;

        (await e).Execute();
    });
}

private static void DoTestB() {
    Entity[] objects = set.GetElements();
    Parallel.For(0, objects.Length, i => {
        Entity e = objects[i];
        if (e == null) return;

        e.Execute();
    });
}

The two routines are identical, except one is awaiting the entity before calling Execute() (Execute() does nothing useful, it's just some dumb code to make sure the processor is really doing something for each Entity).

After executing my test in release mode targeting AnyCPU, I get the following output:

>>> 1500 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        1,301,465ns  1,232,200ns  2,869,000ns  1,567,534ns  ! 1952.199ms
B        130,053ns    116,000ns    711,200ns    581,146ns    ! 195.081ms

As you can see, the method with the await in it is about 10 times slower.

The thing is, as far as I know, there is nothing 'to' await - GetResult is always true. Does this mean that the state machine is executed even if the awaited 'thing' is already ready?

If so, is there any way around this? I'd like to use the semantics of async/await but this overhead is too high for my application...

EDIT: Adding full benchmark code after requested:

Program.cs

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace CSharpPerfTest {
    public class Entity : INotifyCompletion {
        private Action continuation;
        private int i;

        public void OnCompleted(Action continuation) {
            this.continuation = continuation;
        }

        public Entity GetAwaiter() {
            return this;
        }

        public Entity GetResult() {
            return this;
        }

        public bool IsCompleted { get { return true; } }

        public void Execute() {
            if (i > 0) Console.WriteLine("What");
        }
    }

    static class Program {
        static ConcurrentSet<Entity> set;
        const int MAX_ELEMENTS = 50000;

        // Called once before all testing begins
        private static void OnceBefore() {
            set = new ConcurrentSet<Entity>();

            Parallel.For(0, MAX_ELEMENTS, i => {
                set.Add(new Entity());
            });
        }

        // Called twice each repetition, once before DoTestA and once before DoTestB
        private static void PreTest() {

        }

        private static void DoTestA() {
            Entity[] objects = set.GetElements();
            Parallel.For(0, objects.Length, async i => {
                Entity e = objects[i];
                if (e == null) return;
                (await e).Execute();
            });
        }

        private static void DoTestB() {
            Entity[] objects = set.GetElements();
            Parallel.For(0, objects.Length, i => {
                Entity e = objects[i];
                if (e == null) return;
                e.Execute();
            });
        }

        private const int REPETITIONS = 1500;
        private const int JIT_WARMUPS = 10;

        #region Test Harness
        private static double[] aTimes = new double[REPETITIONS];
        private static double[] bTimes = new double[REPETITIONS];

        private static void Main(string[] args) {
            Stopwatch stopwatch = new Stopwatch();

            OnceBefore();

            for (int i = JIT_WARMUPS * -1; i < REPETITIONS; ++i) {
                Console.WriteLine("Starting repetition " + i);

                PreTest();
                stopwatch.Restart();
                DoTestA();
                stopwatch.Stop();
                if (i >= 0) aTimes[i] = stopwatch.Elapsed.TotalMilliseconds;

                PreTest();
                stopwatch.Restart();
                DoTestB();
                stopwatch.Stop();
                if (i >= 0) bTimes[i] = stopwatch.Elapsed.TotalMilliseconds;
            }

            DisplayScores();
        }

        private static void DisplayScores() {
            Console.WriteLine();
            Console.WriteLine();

            bool inNanos = false;
            if (aTimes.Average() < 10 || bTimes.Average() < 10) {
                inNanos = true;
                for (int i = 0; i < aTimes.Length; ++i) aTimes[i] *= 1000000;
                for (int i = 0; i < bTimes.Length; ++i) bTimes[i] *= 1000000;
            }

            Console.WriteLine(">>> " + REPETITIONS + " repetitions >>> " + (inNanos ? "IN NANOSECONDS (1000ns = 0.001ms)" : "IN MILLISECONDS (1000ms = 1s)"));
            Console.WriteLine("Method   Avg.         Min.         Max.         Jitter       Total");

            Console.WriteLine(
            "A        "
            + (String.Format("{0:N0}", (long) aTimes.Average()) + (inNanos ? "ns" : "ms")).PadRight(13, ' ')
            + (String.Format("{0:N0}", (long) aTimes.Min()) + (inNanos ? "ns" : "ms")).PadRight(13, ' ')
            + (String.Format("{0:N0}", (long) aTimes.Max()) + (inNanos ? "ns" : "ms")).PadRight(13, ' ')
            + (String.Format("{0:N0}", (long) Math.Max(aTimes.Average() - aTimes.Min(), aTimes.Max() - aTimes.Average())) + (inNanos ? "ns" : "ms")).PadRight(13, ' ')
            + ((long) aTimes.Sum() >= 10000 && inNanos ? "! " + String.Format("{0:f3}", aTimes.Sum() / 1000000) + "ms" : (long) aTimes.Sum() + (inNanos ? "ns" : "ms"))
            );
            Console.WriteLine(
            "B        "
            + (String.Format("{0:N0}", (long) bTimes.Average()) + (inNanos ? "ns" : "ms")).PadRight(13, ' ')
            + (String.Format("{0:N0}", (long) bTimes.Min()) + (inNanos ? "ns" : "ms")).PadRight(13, ' ')
            + (String.Format("{0:N0}", (long) bTimes.Max()) + (inNanos ? "ns" : "ms")).PadRight(13, ' ')
            + (String.Format("{0:N0}", (long) Math.Max(bTimes.Average() - bTimes.Min(), bTimes.Max() - bTimes.Average())) + (inNanos ? "ns" : "ms")).PadRight(13, ' ')
            + ((long) bTimes.Sum() >= 10000 && inNanos ? "! " + String.Format("{0:f3}", bTimes.Sum() / 1000000) + "ms" : (long) bTimes.Sum() + (inNanos ? "ns" : "ms"))
            );

            Console.ReadKey();
        }
        #endregion

    }
}

The async method (the delegate inside `DoTestA`) has already been rewritten to use the state machine. — Lasse V. Karlsen, Apr 04 '14 at 19:30
Also, note that async is primarily used to wait for I/O-bound operations, and in this case the method call essentially does nothing, and so it isn't unthinkable that the overhead added by the state machine will dwarf what the method is actually doing, in this case. — Lasse V. Karlsen, Apr 04 '14 at 19:36
I agree with lasse, the body of your action is so small the overhead costs are dwarfing it. You need a more realistic test harness or if that is similar to your real code then it does not need to be awaitable. — Scott Chamberlain, Apr 04 '14 at 19:47
@LasseV.Karlsen Noted. I hoped that the compiled await code would check GetAwaiter().IsCompleted before anything else. : — Xenoprimate, Apr 04 '14 at 19:47
@Motig it does, but you added the extra 5 or 6 steps for the compiler to do that. So its like you where comparing the functions `if (i > 0) Console.WriteLine("What");` and `if(e.IsCompleted) {} if (i > 0) Console.WriteLine("What");`, of course the 2nd will be slower! The second function has more steps to do the same job. — Scott Chamberlain, Apr 04 '14 at 19:50
@ScottChamberlain Actually the code is similar to my real code. The overhead of 1ms for 50,000 calls is unacceptable - that's all that matters to me here. — Xenoprimate, Apr 04 '14 at 19:53
Start the program without debugger because the debugger disables optimization. Also, remove parallelism. It only adds lots of noise and unreliability. Your use of await does not use threading anyway. Also, try making it a struct. I'm not surprised by 10x difference. I had expected more. Maybe this is because you are testing GetElements as well which is the same for both. This makes the running times more similar. Remove it. — usr, Apr 04 '14 at 22:01
Better yet, post all your benchmarking code. In most performance questions I find minor or major errors there. — usr, Apr 04 '14 at 22:02
+1 to @usr, I'd love to see the benchmark without `Parallel.For` and without creating a new `async` state machine for each iteration. It's useless and unfair otherwise: too many extra allocations, don't blame the compiler for that. You should understand what's going on behind the scene when you use a particular language/runtime feature. — noseratio, Apr 05 '14 at 01:03
@usr & Noseratio: I'll update the question now with the full benchmark code. I did run it without the debugger, however. Also, I am loathe to remove GetElements because otherwise the compiler or JIT may notice that the parallel.for lambda does nothing, and inline it. Finally, I will keep the parallel.for because that's representative of my actual code - but feel free to take it out and try it on your own machine. — Xenoprimate, Apr 05 '14 at 01:36

score 3 · Answer 1 · edited May 23 '17 at 12:05

3

If your function has a response time that 1ms for 50,000 calls is considered significant you should not be awaiting that code and instead be running it synchronously.

Using async code has a small overhead, it has to add a function calls for the state machine that drives it internally. If the work you are making async is also small in comparison to the overhead costs of running the state machine you should make the code you need to rethink if your code should be async.

edited May 23 '17 at 12:05

Community

1
1

answered Apr 04 '14 at 19:55

Scott Chamberlain

124,994
33
282
431

Well, I disagree - I will be running it asynchronously - I just won't be able to use async/await. Thanks for your answer though. – Xenoprimate Apr 04 '14 at 20:05
1

Actually finite state machine is not created (in case of compiler optimatization enabled) when method doesn't have await in it. What creates overhead in that case is Task allocation and garbage collection of it. It can be easily avoided by returning cached Task value (for example Task.FromResult(true) in case of Task returning method). edit: source: http://channel9.msdn.com/Series/Three-Essential-Tips-for-Async/Async-libraries-APIs-should-be-chunky – fex Apr 04 '14 at 20:10

score 2 · Answer 2 · edited May 23 '17 at 11:49

Converted to an answer from the comments: apparently, this is not a clean benchmark test.

If you do not need asynchronous continuation, just don't use it. The code is always faster without it. If you do need it, then expect some overhead. You should understand what's going on behind the scene when you use a particular language/runtime feature.

Even if you remove Parallel.For, convert lambdas into methods and prevent inlining, there still will be some struct copying and await continuation closure allocation, to support the state machine functionality (an example of the generated code).

A more fair benchmark would be to test async/await vs. alternative implementation using callback closures and Task.ContinueWith, on a thread without synchronization context. I would not expect any significant difference in this case.

On a side note, you're passing an async void lambda Action into Parallel.For. You should be aware that the execution control will return to Parallel.For as soon as there's the 1st await inside the lambda, and then it's essentially a fire-and-forget call outside Parallel.For. I really can't think of any useful scenarios for this.

Tests show "await" is significantly slower, even when object being awaited is already Complete

Program.cs

2 Answers2