14

I've been having trouble with understanding the performance characteristics of using Func<...> throughout my code when using inheritance and generics - which is a combination I find myself using all the time.

Let me start with a minimal test case so we all know what we're talking about, then I'll post the results and then I'm going to explain what I would expect and why...

Minimal test case

public class GenericsTest2 : GenericsTest<int> 
{
    static void Main(string[] args)
    {
        GenericsTest2 at = new GenericsTest2();

        at.test(at.func);
        at.test(at.Check);
        at.test(at.func2);
        at.test(at.Check2);
        at.test((a) => a.Equals(default(int)));
        Console.ReadLine();
    }

    public GenericsTest2()
    {
        func = func2 = (a) => Check(a);
    }

    protected Func<int, bool> func2;

    public bool Check2(int value)
    {
        return value.Equals(default(int));
    }

    public void test(Func<int, bool> func)
    {
        using (Stopwatch sw = new Stopwatch((ts) => { Console.WriteLine("Took {0:0.00}s", ts.TotalSeconds); }))
        {
            for (int i = 0; i < 100000000; ++i)
            {
                func(i);
            }
        }
    }
}

public class GenericsTest<T>
{
    public bool Check(T value)
    {
        return value.Equals(default(T));
    }

    protected Func<T, bool> func;
}

public class Stopwatch : IDisposable
{
    public Stopwatch(Action<TimeSpan> act)
    {
        this.act = act;
        this.start = DateTime.UtcNow;
    }

    private Action<TimeSpan> act;
    private DateTime start;

    public void Dispose()
    {
        act(DateTime.UtcNow.Subtract(start));
    }
}

The results

Took 2.50s  -> at.test(at.func);
Took 1.97s  -> at.test(at.Check);
Took 2.48s  -> at.test(at.func2);
Took 0.72s  -> at.test(at.Check2);
Took 0.81s  -> at.test((a) => a.Equals(default(int)));

What I would expect and why

I would have expect this code to run at exactly the same speed for all 5 methods, to be more precise, even faster than any of this, namely just as fast as:

using (Stopwatch sw = new Stopwatch((ts) => { Console.WriteLine("Took {0:0.00}s", ts.TotalSeconds); }))
{
    for (int i = 0; i < 100000000; ++i)
    {
        bool b = i.Equals(default(int));
    }
}
// this takes 0.32s ?!?

I expected it to take 0.32s because I don't see any reason for the JIT compiler not to inline the code in this particular case.

On closer inspection, I don't understand these performance numbers at all:

  • at.func is passed to the function and cannot be changed during execution. Why isn't this inlined?
  • at.Check is apparently faster than at.Check2, while both cannot be overridden and the IL of at.Check in the case of class GenericsTest2 is as fixed as a rock
  • I see no reason for Func<int, bool> to be slower when passing an inline Func instead of a method that's converted to a Func
  • And why is the difference between test case 2 and 3 a whopping 0.5s while the difference between case 4 and 5 is 0.1s - aren't they supposed to be the same?

Question

I'd really like to understand this... what is going on here that using a generic base class is a whopping 10x slower than inlining the whole lot?

So, basically the question is: why is this happening and how can I fix it?

UPDATE

Based on all the comments so far (thanks!) I did some more digging.

First off, a new set of results when repeating the tests and making the loop 5x larger and executing them 4 times. I've used the Diagnostics stopwatch and added more tests (added description as well).

(Baseline implementation took 2.61s)

--- Run 0 ---
Took 3.00s for (a) => at.Check2(a)
Took 12.04s for Check3<int>
Took 12.51s for (a) => GenericsTest2.Check(a)
Took 13.74s for at.func
Took 16.07s for GenericsTest2.Check
Took 12.99s for at.func2
Took 1.47s for at.Check2
Took 2.31s for (a) => a.Equals(default(int))
--- Run 1 ---
Took 3.18s for (a) => at.Check2(a)
Took 13.29s for Check3<int>
Took 14.10s for (a) => GenericsTest2.Check(a)
Took 13.54s for at.func
Took 13.48s for GenericsTest2.Check
Took 13.89s for at.func2
Took 1.94s for at.Check2
Took 2.61s for (a) => a.Equals(default(int))
--- Run 2 ---
Took 3.18s for (a) => at.Check2(a)
Took 12.91s for Check3<int>
Took 15.20s for (a) => GenericsTest2.Check(a)
Took 12.90s for at.func
Took 13.79s for GenericsTest2.Check
Took 14.52s for at.func2
Took 2.02s for at.Check2
Took 2.67s for (a) => a.Equals(default(int))
--- Run 3 ---
Took 3.17s for (a) => at.Check2(a)
Took 12.69s for Check3<int>
Took 13.58s for (a) => GenericsTest2.Check(a)
Took 14.27s for at.func
Took 12.82s for GenericsTest2.Check
Took 14.03s for at.func2
Took 1.32s for at.Check2
Took 1.70s for (a) => a.Equals(default(int))

I noticed from these results, that the moment you start using generics, it gets much slower. Digging a bit more into the IL I found for the non-generic implementation:

L_0000: ldarga.s 'value'
L_0002: ldc.i4.0 
L_0003: call instance bool [mscorlib]System.Int32::Equals(int32)
L_0008: ret 

and for all the generic implementations:

L_0000: ldarga.s 'value'
L_0002: ldloca.s CS$0$0000
L_0004: initobj !T
L_000a: ldloc.0 
L_000b: box !T
L_0010: constrained. !T
L_0016: callvirt instance bool [mscorlib]System.Object::Equals(object)
L_001b: ret 

While most of this can be optimized, I suppose the callvirt can be a problem here.

In an attempt to make it faster I added the 'T : IEquatable' constraint to the definition of the method. The result is:

L_0011: callvirt instance bool [mscorlib]System.IEquatable`1<!T>::Equals(!0)

While I understand more about the performance now (it probably cannot inline because it creates a vtable lookup), I'm still confused: Why doesn't it simply call T::Equals? After all, I do specify it will be there...

atlaste
  • 30,418
  • 3
  • 57
  • 87
  • 12
    I was with you right until you said "debug/release helps"... you can't do meaningful performance analysis in the debugger. The debugger *turns off the optimizer* so that the program gets easier to debug. – Eric Lippert Mar 27 '13 at 21:03
  • 3
    to avoid JIT issues, make at least one call to your delegate in `test` method before making measurements. – L.B Mar 27 '13 at 21:07
  • 1
    I get similar results with a release build run outside Visual Studio - the last two tests are much faster. – Matthew Watson Mar 27 '13 at 21:09
  • @EricLippert I understand that, sorry for the confusion. I obviously executed everything without a debugger - when I'm compiling for the Debug target it's just slower, but the factor remains. As for the Stopwatch - it should now compile so you can see for yourself. – atlaste Mar 27 '13 at 21:09
  • @L.B That doesn't help a thing... If it would it would have surprised me even more - filling in a simple generic like that doesn't cost a second even on the cheapest hardware you can find. So no, that's not it. – atlaste Mar 27 '13 at 21:15
  • Maybe use `System.Diagnostics.Stopwatch` instead of `DateTime` and call your own class differently? – H H Mar 27 '13 at 21:17
  • I changed my version to use a proper Stopwatch, but I get similar results. – Matthew Watson Mar 27 '13 at 21:22
  • @HenkHolterman P.S. these simple stopwatches are not high-precision, but good enough for an accuracy of +/- 100K ticks which is roughly 0.01s. You can reasonably assume the issue is in the call of the func. – atlaste Mar 27 '13 at 21:26
  • It just makes your code harder to read. Also the uniform "Took" isn't very helpful. – H H Mar 27 '13 at 21:29
  • 2
    Have you considered looking at the differences in the generated IL via reflector, dotpeek, ilspy, or ildasm? – JerKimball Mar 27 '13 at 22:18
  • @JerKimball Yes... but that doesn't really tell anything about what's happening in the JIT compiler. As you might imagine, the calls will remain the same. Like most people, I make assumptions on what happens beyond IL based on what I've read about generics and how it operates, but my measurements simply contradicts the behavior I expect. That said, if you can figure it out using tools like that, please share... – atlaste Mar 27 '13 at 22:26
  • Well, naturally the jit will optimize differently based on the il - was wondering if you saw anything... "weird". I'll see if I can dig anything up tonight. – JerKimball Mar 27 '13 at 22:51
  • Regardless of the stop watch you're using, extending the measurement period to at least several seconds, e.g. go for a billion of iterations. Differences less than one second might be due to the "background noise" of your OS or because of on-demand compilation (what L.B pointed out). – chris Mar 28 '13 at 10:58
  • @chris: Fair enough. Tried it by making the test 10x as large. The result is that the difference is still huge (x3). – atlaste Mar 28 '13 at 11:06

2 Answers2

9

Run micro benchmarks always 3 times. The first will trigger JIT and rule that out. Check if 2nd and 3rd runs are equal. This gives:

... run ...
Took 0.79s
Took 0.63s
Took 0.74s
Took 0.24s
Took 0.32s
... run ...
Took 0.73s
Took 0.63s
Took 0.73s
Took 0.24s
Took 0.33s
... run ...
Took 0.74s
Took 0.63s
Took 0.74s
Took 0.25s
Took 0.33s

The line

func = func2 = (a) => Check(a);

adds an additional function call. Remove it by

func = func2 = this.Check;

gives:

... 1. run ...
Took 0.64s
Took 0.63s
Took 0.63s
Took 0.24s
Took 0.32s
... 2. run ...
Took 0.63s
Took 0.63s
Took 0.63s
Took 0.24s
Took 0.32s
... 3. run ...
Took 0.63s
Took 0.63s
Took 0.63s
Took 0.24s
Took 0.32s

This shows that the (JIT?) effect between 1. and 2. run disappeared due to removing the function call. First 3 tests are now equal.

In tests 4 and 5, the compiler can inline the function argument to void test(Func<>), while in tests 1 to 3 it would be a long way for the compiler to figure out they are constant. Sometimes there are constraints to the compiler that are not easy to see from our coder's perspective, like .Net and Jit constraints coming from the dynamic nature of .Net programs compared to a binary made from c++. In any way, it is the inlining of the function arg that makes the difference here.

Difference between 4 and 5? Well, test5 looks like the compiler can very easily inline the function as well. Maybe he builds a context for closures and resolves it a bit more complex than needed. Did not dig into MSIL to figure out.

Tests above with .Net 4.5. Here with 3.5, demonstrating that the compiler got better with inlining:

... 1. run ...
Took 1.06s
Took 1.06s
Took 1.06s
Took 0.24s
Took 0.27s
... 2. run ...
Took 1.06s
Took 1.08s
Took 1.06s
Took 0.25s
Took 0.27s
... 3. run ...
Took 1.05s
Took 1.06s
Took 1.05s
Took 0.24s
Took 0.27s

and .Net 4:

... 1. run ...
Took 0.97s
Took 0.97s
Took 0.96s
Took 0.22s
Took 0.30s
... 2. run ...
Took 0.96s
Took 0.96s
Took 0.96s
Took 0.22s
Took 0.30s
... 3. run ...
Took 0.97s
Took 0.96s
Took 0.96s
Took 0.22s
Took 0.30s

now changing GenericTest<> to GenericTest !!

... 1. run ...
Took 0.28s
Took 0.24s
Took 0.24s
Took 0.24s
Took 0.27s
... 2. run ...
Took 0.24s
Took 0.24s
Took 0.24s
Took 0.24s
Took 0.27s
... 3. run ...
Took 0.25s
Took 0.25s
Took 0.25s
Took 0.24s
Took 0.27s

Well this is a surprise from the C# compiler, similar to what I encountered with sealing classes to avoid virtual function calls. Maybe Eric Lippert has a word on that?

Removing the inheritance to aggregation brings performance back. I learned to never use inheritance, ok very very rarely, and can highly recommend you to avoid it at least in this case. (This is my pragmatic solution to this qustion, no flamewars intended). I use interfaces all the way tough, and they carry no performance penalties.

citykid
  • 9,916
  • 10
  • 55
  • 91
  • This is strange, these aren't the timings I get at all when I try to do exactly this... Are you using .NET 4.0 or 4.5? (I'm using 4.0). Also, I'll post some of my findings based on all the comments. – atlaste Mar 28 '13 at 11:19
  • thanks, it seems they did change something... I also added more info; my findings are that generics use simple inheritance (`callvirt`) instead of making a real `call`... that's a real game changer, and it definitely explains why it's so damn slow... Any comments on these findings? – atlaste Mar 28 '13 at 11:35
  • In brief, I remember that callvirt occurs in places when one expects call. I had such issue with sealed classes, where methods were not directly called although the compiler should have known. Usually there is a workaround to get the last performance peaces. In your case not? – citykid Mar 28 '13 at 11:51
  • No, my design is based on the assumption that generics use `call` instead of `callvirt` (in certain cases...). I've dug through a lot of IL now and it appears that everywhere generics are used, calls are solved with a `callvirt`. IMHO this is a very serious design flaw in the .NET compiler. For my program, this is a serious setback because every single class I have is a generic class, which I will now have to rewrite to 'object' and manual casting (somehow...) to get a somewhat decent performance... As you might understand, I'm currently not amused... – atlaste Mar 28 '13 at 11:59
  • as in the edit: changing inheritance to aggregation if possible would solve it. I also checked to seal the class, and this is what I also consider bad: The compiler really should make performance sense of sealed other then constraining the developer. – citykid Mar 28 '13 at 12:14
  • I couldn't agree more. See below for my analysis, which I just added for completeness. I filed a bug report at Microsoft, let's see what happens... – atlaste Mar 28 '13 at 13:06
  • 3
    When the method called is not virtual, callvirt is documented as having the exact same semantics as call, except that callvirt does a null check on top. That's why the C# compiler generates the callvirt instead of the call; because it knows that it needs the null check on the receiver. Otherwise it would have to generate the null check followed by call, which would be larger and slower code. – Eric Lippert Mar 28 '13 at 13:51
  • 1
    As for the question of why performance is worse in the generic class, I don't know; I'm not an expert on the jitter. It would be interesting for you to compare the x86 and x64 jitters, as they are completely different code. – Eric Lippert Mar 28 '13 at 13:54
  • @EricLippert I just noticed you write something I find strange... afaik `callvirt` does a vtable lookup and a null check, but `call` does neither of these. In other words, if you `call` `Object.Equals`, it doesn't handle inheritance, while a `callvirt` does. But your saying that both do a vtable lookup? – atlaste Mar 29 '13 at 10:33
  • 1
    As it turns out this problem was about selecting the correct Equals call (see my answer below). As for the inlining / JIT'ter, it turns out that you can actually help the compiler quite a bit: Delegates are never inlined, but func's usually are. In other words, if you pass a func as a parameter to a function, it will be faster than using a delegate stored in an object... do that a lot and you will end up with faster code (in my case the results were quite significant). See my answer below and MS connect for details. – atlaste Mar 29 '13 at 10:39
  • 2
    @StefandeBruijn: callvirt doesn't do a vtable lookup if there is no vtable! It is perfectly legal to call a static or instance method with callvirt; the jitter will turn it into a non-virtual call with a null check on top. – Eric Lippert Mar 29 '13 at 14:47
  • @EricLippert "callvirt doesn't do a vtable lookup if there is no vtable!" - I'm not questioning that. I was questioning the opposite: afaik `callvirt` *might* do a vtable lookup (if there is one), but `call` will never do a vtable lookup (even if there is one). In other words: `callvirt` does more than just the null check (your statement "callvirt is documented as having the exact same semantics as call" confuses me). – atlaste Mar 29 '13 at 15:41
  • @EricLippert **AH** you're referring to "When the method called is not virtual", I missed that while reading it the first 4 times... and suddenly it all makes sense again. :-) – atlaste Mar 29 '13 at 16:24
  • @StefandeBruijn: Excellent. Incidentally, in some scenarios where the C# compiler can trivially prove that the receiver is not null then call *is* used instead of callvirt when calling a non-virtual method. See my comment to http://stackoverflow.com/questions/845657 – Eric Lippert Mar 29 '13 at 16:57
  • @EricLippert Ah, that also explains why calls to `this` are `call`'s and calls to other objects are `callvirt` ([..] incidental exceptions excluded). I've noticed quite a few cases now where your C# code influences the IL code, this is a new one... my observation so far is that it makes sense to create your code as brief as possible. F.ex. probably `var type = (a = new Object()).GetType();` will evaluate to a `dup` and a `call` and is therefore not the same as `a = new Object(); var type = a.GetType()`... I just hope the JIT compiles it all away, although it's pretty much a black box to me... – atlaste Mar 29 '13 at 18:46
  • @StefandeBruijn: The new IL optimizer in Roslyn does a pretty good job with this sort of thing. However, of course the IL is not what runs. Let the jitter do its job and don't worry too much about what the IL says. – Eric Lippert Mar 29 '13 at 20:08
  • @EricLippert Well now, that's exactly the problem, isn't it?: I want to be able to predict the results I'm getting. When I design an application, I can more or less predict the code I will have to write - and when I write a piece of code, I can predict what IL it will give. However, without opening the 'JIT' black box, it's impossible to predict what the performance will be of the whole - in fact I've found that some constructions are about 5x slower than others. My problems usually involve processing lots and lots of data, so predicting results really matters to me... – atlaste Mar 29 '13 at 22:46
3

I'm going to explain what I think is going on here and with all generics. I needed some space to write, so I'm posting this as an answer. Thank you all for commenting and helping figuring this out, I'll make sure to award points here and there.

To get started...

Compiling generics

As we all know, generics are 'template' types where the compiler fills in the type information at run-time. It can make assumptions based on the constraints, but it doesn't change the IL code... (but more about that later).

A method from my question:

public class Foo<T>
{
    public void bool Handle(T foo) 
    {
        return foo.Equals(default(T));
    }
}

The constraints here are that T is an Object, which means the call to Equals is going to Object.Equals. Since T is implementing Object.Equals, this will look like:

L_0016: callvirt instance bool [mscorlib]System.Object::Equals(object)

We can improve on this by making it explicit that T implements Equals by adding the constraint T : IEquatable<T> . This changes the call to:

L_0011: callvirt instance bool [mscorlib]System.IEquatable`1<!T>::Equals(!0)

However, since T hasn't been filled in yet, apparently the IL doesn't support calling T::Equals(!0) directly even though it is surely there. The compiler can apparently only assume the constraint has been fulfilled, hence it needs to issue a call to IEquatable1` that defines the method.

Apparently hints like sealed don't make a difference, even though they should have.

Conclusion: Because T::Equals(!0) is not supported, a vtable lookup is required to make it work. Once it has become a callvirt, it's damn difficult for the JIT compiler to figure out that it should have just used a call.

What should happen: Basically Microsoft should support T::Equals(!0) when this method clearly exists. That changes the call to a normal call in IL, making it much faster.

But it gets worse

So what about calling Foo::Handle?

What surprised me is that the call to Foo<T>::Handle is also a callvirt and not a call. The same behavior can be found for f.ex. List<T>::Add and so on. My observation was that only calls that use this will become a normal call; everything else will compile as a callvirt.

Conclusion: The behavior is as-if you get a class structure like Foo<int>:Foo<T>:[the rest], which doesn't really make sense. Apparently all calls to a generic class from outside that class will compile a vtable lookup.

What should happen: Microsoft should change the callvirt to a call if the method is non-virtual. Threre's really no reason at all for the callvirt.

Conclusion

If you use generics from another type, be prepared to get a callvirt instead of a call, even if this isn't necessary. The resulting performance is basically what you can expect from such a call...

IMHO this is a real shame. Type safety should help developers and at the same time make your code faster because the compiler can make assumptions about what's going on. My lesson learned from all this is: don't use generics, unless you don't care about the extra vtable lookups (until Microsoft fixed this).

Future work

First off, I'm going to post this on Microsoft Connect. I think this is a serious bug in .NET that drains performance without any good reason. ( https://connect.microsoft.com/VisualStudio/feedback/details/782346/using-generics-will-always-compile-to-callvirt-even-if-this-is-not-necessary )


Results from Microsoft Connect

Yes, we have results, with my express thanks to Mike Danes!

The method call to foo.Equals(default(T)) will compile to Object.Equals(boxed[new !0]) because the only equals that all T's have in common is Object.Equals. This will cause a boxing operation and a vtable lookup.

If we want the thing to use the correct Equals, we have to give the compiler a hint, namely that the type implement bool Equals(T). This can be done by telling the compiler that the type T implements IEquatable<T>.

In other words: change the signature of the class as follows:

public class GenericsTest<T> where T:IEquatable<T>
{
    public bool Check(T value)
    {
        return value.Equals(default(T));
    }

    protected Func<T, bool> func;
}

When you do it like this, the runtime will find the correct Equals method. Phew...

To solve the puzzle completely, one more element is required: .NET 4.5. The runtime of .NET 4.5 is able to inline this method, thereby making it as fast as it should be again. In .NET 4.0 (that's what I'm currently using), this functionality doesn't appear to be there. The call will still be a callvirt in IL, but the runtime will solve the puzzle regardless.

If you test this code, it should be just as fast as the fastest test cases. Can someone please confirm this?

atlaste
  • 30,418
  • 3
  • 57
  • 87
  • Great you opened a call, let us know about news. This is about inheritance, generics and the general fact that the c# compiler ignores performance optimizations possible on sealed classes. Would volunteer to contribute to a codeplex or github repo with demo cases. – citykid Mar 28 '13 at 13:09
  • I knew there was a "reason" and I found it again! "We thought that being able to call a method on a null instance was a bit weird. Peter Golde did some testing to see what the perf impact was of always using callvirt, and it was small enough that we decided to make the change." Sorry guys, very very bad decision, please revise it. http://blogs.msdn.com/b/ericgu/archive/2008/07/02/why-does-c-always-use-callvirt.aspx – citykid Mar 28 '13 at 13:20
  • 1
    @citykid I have some results back from Microsoft Connect. Since I'm not on .NET 4.5 yet but apparently you are - could you please confirm that this solves the complete puzzle? – atlaste Mar 28 '13 at 16:48