Generic vs not-generic performance in C#

Question

I've written two equivalent methods:

static bool F<T>(T a, T b) where T : class
{
    return a == b;
}

static bool F2(A a, A b)
{
    return a == b;
}

Time difference:
00:00:00.0380022
00:00:00.0170009

Code for testing:

var a = new A();
for (int i = 0; i < 100000000; i++)
    F<A>(a, a);
Console.WriteLine(DateTime.Now - dt);

dt = DateTime.Now;
for (int i = 0; i < 100000000; i++)
    F2(a, a);
Console.WriteLine(DateTime.Now - dt);

Does anyone know why?

In a comment below, dtb* show CIL:

IL for F2: ldarg.0, ldarg.1, ceq, ret. IL for F<T>: ldarg.0, box !!T, ldarg.1, box !!T, ceq, ret.

I think it's the answer for my question, but what magic can I use to deny boxing?

Next I use code from Psilon:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace ConsoleApplication58
{
    internal class Program
    {
        private class A
        {

        }

        private static bool F<T>(T a, T b) where T : class
        {
            return a == b;
        }

        private static bool F2(A a, A b)
        {
            return a == b;
        }

        private static void Main()
        {
            const int rounds = 100, n = 10000000;
            var a = new A();
            var fList = new List<TimeSpan>();
            var f2List = new List<TimeSpan>();
            for (int i = 0; i < rounds; i++)
            {
                // Test generic
                GCClear();
                bool res;
                var sw = new Stopwatch();
                sw.Start();
                for (int j = 0; j < n; j++)
                {
                    res = F(a, a);
                }
                sw.Stop();
                fList.Add(sw.Elapsed);

                // Test not-generic
                GCClear();
                bool res2;
                var sw2 = new Stopwatch();
                sw2.Start();
                for (int j = 0; j < n; j++)
                {
                    res2 = F2(a, a);
                }
                sw2.Stop();
                f2List.Add(sw2.Elapsed);
            }
            double f1AverageTicks = fList.Average(ts => ts.Ticks);
            Console.WriteLine("Elapsed for F = {0} \t ticks = {1}", fList.Average(ts => ts.TotalMilliseconds),
                              f1AverageTicks);
            double f2AverageTicks = f2List.Average(ts => ts.Ticks);
            Console.WriteLine("Elapsed for F2 = {0} \t ticks = {1}", f2List.Average(ts => ts.TotalMilliseconds),
                  f2AverageTicks);
            Console.WriteLine("Not-generic method is {0} times faster, or on {1}%", f1AverageTicks/f2AverageTicks,
                              (f1AverageTicks/f2AverageTicks - 1)*100);
            Console.ReadKey();
        }

        private static void GCClear()
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
            GC.Collect();
        }
    }
}

Windows 7, .NET 4.5, Visual Studio 2012, release, optimized, without attaching.

x64

Elapsed for F = 23.68157         ticks = 236815.7
Elapsed for F2 = 1.701638        ticks = 17016.38
Not-generic method is 13.916925926666 times faster, or on 1291.6925926666%

x86

Elapsed for F = 6.713223         ticks = 67132.23
Elapsed for F2 = 6.729897        ticks = 67298.97
Not-generic method is 0.997522398931217 times faster, or on -0.247760106878314%

And I've got new magic: x64 is three times faster...

PS: My target platform is x64.

Have you checked the IL? The 2nd example might be inlined by the compiler. — Matthew, Jun 25 '13 at 20:58
I get 00:00:00.2080244 and 00:00:00.0071957 using a Stopwatch in Release build without debugger. — dtb, Jun 25 '13 at 20:58
i've used .net 4.5, and it inlines both methods; dtb, i think if you add [MethodImpl(MethodImplOptions.AggressiveInlining)] - you ll have the same results — Dmitry, Jun 25 '13 at 21:02
This is heavily dependent on the system: for example, [on Mono you get nearly identical running times](http://ideone.com/q3jE31). You may want to add an untimed call to `F(a, a);` before starting the first timer to see if it even things out. — Sergey Kalinichenko, Jun 25 '13 at 21:02
IL for F2: `ldarg.0, ldarg.1, ceq, ret`. IL for F: `ldarg.0, box !!T, ldarg.1, box !!T, ceq, ret`. So the answer is because **the arguments get boxed.** But why do they get boxed? — dtb, Jun 25 '13 at 21:03
@Matthew you can't see inlining on IL level. That's a JIT trick. That's assembly. — Steven, Jun 25 '13 at 21:04
Is there a performance differance between "var a = new A();" and "A a = new A();"? — frugi, Jun 25 '13 at 21:06
@frugi: no, it's syntactic sugar, the compiler doesn't care. — Dan Puzey, Jun 25 '13 at 21:08
@dtb: are you sure they're boxed even with `where T : class`? — Dan Puzey, Jun 25 '13 at 21:08
Got 27ms for the first one and 27ms for the second, using Stopwatch in Release build with a warm-up. In debug (which we shouldn't care about) I got 1177 and 1139ms. Also, the in F (a, a) is redundant according to ReSharper. — Pierre-Luc Pineault, Jun 25 '13 at 21:09
@DanPuzey: Yes. That surprises me as well. I'm using Visual Studio 2012 and ILspy. — dtb, Jun 25 '13 at 21:09
@dtb: You're looking at IL, which doesn't matter. You need to look at the native code the JIT generates. — Ben Voigt, Jun 25 '13 at 21:17
I find it surprising that you try to spot differences in IL code whereas the generated assembly code is identical. Any performance difference you will measure is by definition a measurement error. — Alois Kraus, Jun 26 '13 at 05:39
The factor 13 speed difference is another measurement error. The non generic method was inlined while the generic method was not in x64. The generated assembly code is identical. If you do not look deeper you will uncover one mystery over the other but you will not learn why your application is behaving differently. — Alois Kraus, Jun 26 '13 at 09:49
Does swapping the order makes a difference? The last commands are usually executed much faster because of jit compilation and because of caching. — Willem Van Onsem, Jun 01 '15 at 18:49

Alois Kraus · Answer 1 · 2013-06-25T22:12:29.563

I did make some changes to your code to measure perf correctly.

Use Stopwatch
Execute Release Mode
Prevent Inlining.
Use GetHashCode() to do some real work
Look at the generated Assembly code

Here is the code:

class A
{
}

[MethodImpl(MethodImplOptions.NoInlining)]
static bool F<T>(T a, T b) where T : class
{
    return a.GetHashCode() == b.GetHashCode();
}

[MethodImpl(MethodImplOptions.NoInlining)]
static bool F2(A a, A b)
{
    return a.GetHashCode() == b.GetHashCode();
}

static int Main(string[] args)
{
    const int Runs = 100 * 1000 * 1000;
    var a = new A();
    bool lret = F<A>(a, a);
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < Runs; i++)
    {
        F<A>(a, a);
    }
    sw.Stop();
    Console.WriteLine("Generic: {0:F2}s", sw.Elapsed.TotalSeconds);

    lret = F2(a, a);
    sw = Stopwatch.StartNew();
    for (int i = 0; i < Runs; i++)
    {
        F2(a, a);
    }
    sw.Stop();
    Console.WriteLine("Non Generic: {0:F2}s", sw.Elapsed.TotalSeconds);

    return lret ? 1 : 0;
}

During my tests the non generic version was slightly faster (.NET 4.5 x32 Windows 7). But there is practically no measurable difference in speed. I would say the are both equal. For completeness here is the assembly code of the generic version: I got the assembly code via the debugger in Release mode with JIT optimizations enabled.The default is to disable JIT optimizations during debugging to make setting breakpoints and variables inspection easier.

Generic

static bool F<T>(T a, T b) where T : class
{
        return a.GetHashCode() == b.GetHashCode();
}

push        ebp 
mov         ebp,esp 
push        ebx 
sub         esp,8 // reserve stack for two locals 
mov         dword ptr [ebp-8],ecx // store first arg on stack
mov         dword ptr [ebp-0Ch],edx // store second arg on stack
mov         ecx,dword ptr [ebp-8] // get first arg from stack --> stupid!
mov         eax,dword ptr [ecx]   // load MT pointer from a instance
mov         eax,dword ptr [eax+28h] // Locate method table start
call        dword ptr [eax+8] //GetHashCode // call GetHashCode function pointer which is the second method starting from the method table
mov         ebx,eax           // store result in ebx
mov         ecx,dword ptr [ebp-0Ch] // get second arg
mov         eax,dword ptr [ecx]     // call method as usual ...
mov         eax,dword ptr [eax+28h] 
call        dword ptr [eax+8] //GetHashCode
cmp         ebx,eax 
sete        al 
movzx       eax,al 
lea         esp,[ebp-4] 
pop         ebx 
pop         ebp 
ret         4

Non Generic

static bool F2(A a, A b)
{
  return a.GetHashCode() == b.GetHashCode();
}

push        ebp 
mov         ebp,esp 
push        esi 
push        ebx 
mov         esi,edx 
mov         eax,dword ptr [ecx] 
mov         eax,dword ptr [eax+28h] 
call        dword ptr [eax+8] //GetHashCode
mov         ebx,eax 
mov         ecx,esi 
mov         eax,dword ptr [ecx] 
mov         eax,dword ptr [eax+28h] 
call        dword ptr [eax+8] //GetHashCode
cmp         ebx,eax 
sete        al 
movzx       eax,al 
pop         ebx 
pop         esi 
pop         ebp 
ret

As you can see the generic version looks slightly more inefficient due to more stack memoy operations which are not perfect but in reality the difference is not measurable since all is fitting into the L1 cache of the processor which makes the memory operations less costly compared to the pure register operations of the non generic version. I would suspect that the non generic version should perform a little better in real world if you need to pay for real memory access not coming from any CPU cache.

For all practical purposes these both methods are identical. You should look at some other place for real world performance gains. I would first look at the data access patterns and used data structures. Algorithmic changes tend to bring much more perf gain than such low level stuff.

Edit1: If you want to use == then you will find

00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  cmp         ecx,edx // Check for reference equality 
00000005  sete        al 
00000008  movzx       eax,al 
0000000b  pop         ebp 
0000000c  ret         4

both methods produce exactly the same machine code. Any difference you did measure are your measurement errors.

thanks, but i ve interested in how to fast my app, i dont want to use GetHashCode, in this case its slow — Dmitry, Jun 25 '13 at 22:06
See my edit. The generated assembly code for both methods is identical. All previous posters speculating about the differences in IL code and other influencing factors did never look at the acutally executed code. There is no difference at assembly level! Any perf difference is simply a measurement error. — Alois Kraus, Jun 25 '13 at 22:14

score 7 · Answer 2 · edited May 23 '17 at 10:29

Your testing method is flawed. There are a few big problems with how you did it.

First, you did not provide a "warm-up". In .NET the first time you access something it will be slower than subsequent calls so it can load up any needed assemblies. If you are going to perform tests like this you must do each function at least once or the first test to run will suffer a large penalty. Go ahead and swap the order, you will likely see the opposite results.

Second DateTime is only accurate to 16ms, so when comparing two times you have a +/- error of 32 ms. The difference between the two results are 21 ms, well within the experimental error. You must use a more accurate timer like the Stopwatch class.

Lastly, don't do artificial tests like this. They don't show you any useful information other than bragging rights for one class or another. Instead learn to use a Code Profiler. This will show you what is slowing down your code and you can make informed decisions on how to solve the problem instead of "guessing" that not using a templated class will make your code faster.

Here is a example code that shows how it "should" be done:

using System;
using System.Diagnostics;

namespace Sandbox_Console
{
    class A
    {
    }

    internal static class Program
    {
        static bool F<T>(T a, T b) where T : class
        {
            return a == b;
        }

        static bool F2(A a, A b)
        {
            return a == b;
        }

        private static void Main()
        {
            var a = new A();
            Stopwatch st = new Stopwatch();

            Console.WriteLine("warmup");
            st.Start();
            for (int i = 0; i < 100000000; i++)
                F<A>(a, a);
            Console.WriteLine(st.Elapsed);

            st.Restart();
            for (int i = 0; i < 100000000; i++)
                F2(a, a);
            Console.WriteLine(st.Elapsed);

            Console.WriteLine("real");
            st.Restart();
            for (int i = 0; i < 100000000; i++)
                F<A>(a, a);
            Console.WriteLine(st.Elapsed);

            st.Restart();
            for (int i = 0; i < 100000000; i++)
                F2(a, a);
            Console.WriteLine(st.Elapsed);

            Console.WriteLine("Done");
            Console.ReadLine();
        }

    }
}

And here are the results:

warmup
00:00:00.0297904
00:00:00.0298949
real
00:00:00.0296838
00:00:00.0297823
Done

Swapping the order of the last two the first one is always shorter, so effectively they are the "same time" as it is within the experimental error.

Good advice, but not the answer. I verified the results using Stopwatch, warm-up, etc. See my comment above. — dtb, Jun 25 '13 at 21:08
@dtb I have to disagree with your results then, I just posted my results and I get 0.1ms difference and the first one always wins. you did something wrong that made that first one run super slow. — Scott Chamberlain, Jun 25 '13 at 21:22
warmup
00:00:00.2611852
00:00:00.0265555
real
00:00:00.2700937
00:00:00.0181213
win7, vs 12, .net 4.5, x64 — Dmitry, Jun 25 '13 at 21:24
What are you using for `A`? a class (like mine) or something like a int (which is really a struct)? Also are you running in release mode with no debugger attached? Having a debugger attached will greatly affect the results because it disables a lot of the built in optimizers dealing with garbage collection (because what if you paused the program and wanted to see what the value of a variable was in the watch window, you can't let a variable go out of scope as easily when a debugger is attached). — Scott Chamberlain, Jun 25 '13 at 21:26
@Scott Chamberlain your code, your A, release, without attach, — Dmitry, Jun 25 '13 at 21:34
Interesting.... Re-run your code as x86 and see what happens ;). I have no idea why building as x64 would slow down one of the two so much. — Scott Chamberlain, Jun 25 '13 at 21:37
Running the code as x86 changes the results to the same that you're seeing (both values roughly the same, and slower than x64). Interesting. — dtb, Jun 25 '13 at 21:40

score 6 · Answer 3 · answered Jun 25 '13 at 21:24

6

Stop worrying about timing, worry about correctness.

Those methods are not equivalent. One of them uses class A's operator== and the other uses object's operator==.

answered Jun 25 '13 at 21:24

Ben Voigt

277,958
43
419
720

F uses object's operator==, and F2 uses class A's operator==, right? – dtb Jun 25 '13 at 21:27
@dtb: Yes, the generic has to call using `System.Object`'s operators. – Ben Voigt Jun 25 '13 at 21:27
6

So unless class A overloads operator==, both are using object's operator== and thus *are* equivalent. – dtb Jun 25 '13 at 21:29
@dtb Well, if `A` or any of it's super types overloads `operator ==`. If `object` is the most derived type in the inheritance chain that overloads `==` then it will use that. – Servy Jun 25 '13 at 21:34
1

@dtb: There's nothing in this question that tells me whether `class A` has an `operator==`... and for at least some possible generic arguments, the result will be different. – Ben Voigt Jun 25 '13 at 21:40
I didn't take the question being about a need to worry about the performance, but as being curious about what goes on under the hood. It's quite surprising that two methods that both test for reference equality perform so differently; the case where A overrides == is not interesting here. I've looked at the x64 assembly code generated for both methods and cannot spot any big difference, but maybe I' doing it wrong. – dtb Jun 25 '13 at 21:47

score 3 · Answer 4 · edited Apr 22 '16 at 10:34

3

Two things:

You are benchmarking with DateTime.Now. Use Stopwatch instead.
You are running code that is not under normal circumstances. JIT is most likely affecting the first run, making your first method slower.

If you switch the order of your tests (i.e. test the non-generic method first), does your result reverse? I would suspect so. When I plugged your code into LINQPad, and then copied it so that it ran both tests twice, the execution times for the second iteration were within a few hundred ticks of each other.

So, in answer to your question: yes, someone knows why. It's because your benchmark is inaccurate!

edited Apr 22 '16 at 10:34

Peter Mortensen

30,738
21
105
131

answered Jun 25 '13 at 21:05

Dan Puzey

33,626
4
73
96

Order doesn't change nothing. Problem is with boxing operation in generic method – nirmus Jun 25 '13 at 21:18
7

@nirmus: There is no boxing here. For reference types, the `box` instruction does nothing at all. It will be removed by the JIT. – Ben Voigt Jun 25 '13 at 21:26

score 3 · Answer 5 · edited May 23 '17 at 12:34

3

I rewrote your test code:

var stopwatch = new Stopwatch();
var a = new A();

stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < 100000000; i++)
    F<A>(a, a);
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds);

stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < 100000000; i++)
    F2(a, a);
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds);

Swapping the order doesn't change anything.

CIL for generic method:

L_0000: nop
L_0001: ldarg.0
L_0002: box !!T
L_0007: ldarg.1
L_0008: box !!T
L_000d: ceq
L_000f: stloc.0
L_0010: br.s L_0012
L_0012: ldloc.0
L_0013: ret

And for non-generic:

L_0000: nop
L_0001: ldarg.0
L_0002: ldarg.1
L_0003: ceq
L_0005: stloc.0
L_0006: br.s L_0008
L_0008: ldloc.0
L_0009: ret

So the boxing operation is the reason of your time difference. The question is why the boxing operation is added. Check it, Stack Overflow question Boxing when using generics in C#

edited May 23 '17 at 12:34

Community

1
1

answered Jun 25 '13 at 21:17

nirmus

4,913
9
31
50

3

But how do the `box` instructions affect the code generated by the JIT (which is to say, the code that the processor actually executes)? – Ben Voigt Jun 25 '13 at 21:19
Yes, how can i stop boxing in templates? – Dmitry Jun 25 '13 at 21:36
1

@Dmitry: C# doesn't have templates. .NET generics are nothing like C++ templates (at least in relation to this question). – Ben Voigt Jun 25 '13 at 21:37

score 2 · Answer 6 · edited Apr 22 '16 at 10:44

I've done performance analysis in a professional capacity several times in my career, and have a couple of observations.

First, the test is way too short to be valid. My rule of thumb is that a performance test should run for 30 minutes or so.
Second, it's important to run the test many times, to get a range of timings.
Third, I'm surprised the compiler didn't optimize away the loops, since the function results are not used and the functions being called have no side effects.
Fourth, micro benchmarks are often misleading.

I once worked on a compiler team that had a big audacious performance goal. One build introduced an optimization that eliminated several instructions for a particular sequence. It should have improved performance, but instead the performance of one benchmark fell dramatically. We were running on hardware with a direct mapped cache. It turns out that the code for the loop and the function called in the inner loop occupied the same cache line with the new optimization in place, but did not with the prior generated code. In other words, that benchmark was really a memory benchmark, and entirely dependent on memory cache hits and misses, whereas the authors thought they had written a computational benchmark.

score 1 · Answer 7 · edited Apr 22 '16 at 10:38

It seems more fair, no?:D

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace ConsoleApplication58
{
    internal class Program
    {
        private class A
        {

        }

        private static bool F<T>(T a, T b) where T : class
        {
            return a == b;
        }

        private static bool F2(A a, A b)
        {
            return a == b;
        }

        private static void Main()
        {
            const int rounds = 100, n = 10000000;
            var a = new A();
            var fList = new List<TimeSpan>();
            var f2List = new List<TimeSpan>();
            for (int i = 0; i < rounds; i++)
            {
                //test generic
                GCClear();
                bool res;
                var sw = new Stopwatch();
                sw.Start();
                for (int j = 0; j < n; j++)
                {
                    res = F(a, a);
                }
                sw.Stop();
                fList.Add(sw.Elapsed);

                //test not-generic
                GCClear();
                bool res2;
                var sw2 = new Stopwatch();
                sw2.Start();
                for (int j = 0; j < n; j++)
                {
                    res2 = F2(a, a);
                }
                sw2.Stop();
                f2List.Add(sw2.Elapsed);
            }
            double f1AverageTicks = fList.Average(ts => ts.Ticks);
            Console.WriteLine("Elapsed for F = {0} \t ticks = {1}", fList.Average(ts => ts.TotalMilliseconds),
                              f1AverageTicks);
            double f2AverageTicks = f2List.Average(ts => ts.Ticks);
            Console.WriteLine("Elapsed for F2 = {0} \t ticks = {1}", f2List.Average(ts => ts.TotalMilliseconds),
                  f2AverageTicks);
            Console.WriteLine("Not-generic method is {0} times faster, or on {1}%", f1AverageTicks/f2AverageTicks,
                              (f1AverageTicks/f2AverageTicks - 1)*100);
            Console.ReadKey();
        }

        private static void GCClear()
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
            GC.Collect();
        }
    }
}

On my laptop i7-3615qm, generic is faster than not-generic.

See http://ideone.com/Y1GIJK.

What compile options did you use? The filename suggests it's a Debug build, without optimization, which isn't interesting. — Ben Voigt, Jun 25 '13 at 21:18
For fair testing i've posted it on ideone, you can test it yourself. What's about optimization - i didn't use it because compiler can not compile a cycle with a variable that isn't used anywhere. And also you can copypaste this code in your environment and test as your heart desires — Psilon, Jun 25 '13 at 21:25
on my laptop, result: Not-generic method is 13.9610133964644 times faster, or on 1296.10133964644% — Dmitry, Jun 25 '13 at 21:32
that's strange... you didn't change N? increasing N increase perfomance of generics... как-то странно, ну не может в 13 раз отличаться... — Psilon, Jun 25 '13 at 21:44
did'nt change N, but i ve use x64. interesting on x86 result genetic/non the same, but 6 times larger — Dmitry, Jun 25 '13 at 21:51
спасибо, я добавил ваш код в пост, отличается походу из-за платформы x64/x86 — Dmitry, Jun 25 '13 at 22:08

score 1 · Answer 8 · edited Apr 22 '16 at 10:42

I think it's the answer for my question, but what magic can I use to deny boxing?

If your goal is only to compare, you can do this:

    public class A : IEquatable<A> {
        public bool Equals( A other ) { return this == other; }
    }
    static bool F<T>( IEquatable<T> a, IEquatable<T> b ) where T : IEquatable<T> {
        return a==b;
    }

This will avoid the boxing.

As for the major timing deviation, I think everyone already established there was a problem with how you set up the stopwatch. I use a different technique where if I want to remove the loop itself from the time result, I take an empty baseline and just subtract that from the difference of time. It's not perfect, but it produces a fair result and doesn't slow from starting and stopping the timer over and over.

Generic vs not-generic performance in C#

8 Answers8

Linked