24

Why is a Func<> created from an Expression<Func<>> via .Compile() considerably slower than just using a Func<> declared directly ?

I just changed from using a Func<IInterface, object> declared directly to one created from an Expression<Func<IInterface, object>> in an app i am working on and i noticed that the performance went down.

I have just done a little test, and the Func<> created from an Expression takes "almost" double the time of an Func<> declared directly.

On my machine the Direct Func<> takes about 7.5 seconds and the Expression<Func<>> takes about 12.6 seconds.

Here is the test code I used (running Net 4.0)

// Direct
Func<int, Foo> test1 = x => new Foo(x * 2);

int counter1 = 0;

Stopwatch s1 = new Stopwatch();
s1.Start();
for (int i = 0; i < 300000000; i++)
{
 counter1 += test1(i).Value;
}
s1.Stop();
var result1 = s1.Elapsed;



// Expression . Compile()
Expression<Func<int, Foo>> expression = x => new Foo(x * 2);
Func<int, Foo> test2 = expression.Compile();

int counter2 = 0;

Stopwatch s2 = new Stopwatch();
s2.Start();
for (int i = 0; i < 300000000; i++)
{
 counter2 += test2(i).Value;
}
s2.Stop();
var result2 = s2.Elapsed;



public class Foo
{
 public Foo(int i)
 {
  Value = i;
 }
 public int Value { get; set; }
}

How can i get the performance back ?

Is there anything i can do to get the Func<> created from the Expression<Func<>> to perform like one declared directly ?

Gabe
  • 84,912
  • 12
  • 139
  • 238
MartinF
  • 5,929
  • 5
  • 40
  • 29
  • 2
    Interesting question; I actually get closer to 4x the performance for the direct case. – Marc Gravell Nov 18 '10 at 03:34
  • (my timing is in release, at command line, with full GC before both tests) – Marc Gravell Nov 18 '10 at 03:35
  • There doesnt seem to be a difference if the Func is a Func – MartinF Nov 18 '10 at 03:37
  • 3
    It might be revealing to reflect and read out the IL generated for each mechanism. – cdhowie Nov 18 '10 at 03:51
  • 1
    @cdhowie I can't get dnp to build a dissassembly for this one :| http://dotnetpad.net/ViewPaste/_Vx1bk-DVkqxCcSU1HE8tw# – jcolebrand Nov 18 '10 at 04:03
  • I will post the IL as a community wiki answer. – cdhowie Nov 18 '10 at 04:14
  • When I run a Debug build under a debugger, they both run for 13 seconds. I think it's safe to say that the same IL is being generated for both cases. – Gabe Nov 18 '10 at 05:06
  • The IL generated is only slightly different in 2 bytes. The first being because of the argument (ldarg0 vs ldarg1), and the other being because the token is different as they hosted in different modules. – Michael B Nov 18 '10 at 06:04
  • Thanks for all the good answers. Choosing which one to accept is not always easy :) Gabe also wrote an example of how to improve the performance which was also a part of my question and have most upvotes so i have accepted his answer. – MartinF Nov 24 '10 at 15:55
  • Something must have changed in the 8 years since this question was asked, because I am now getting the exact opposite results: compiled delegates are consistently faster than declared counterparts. – Mike-E May 02 '18 at 20:54

6 Answers6

19

As others have mentioned, the overhead of calling a dynamic delegate is causing your slowdown. On my computer that overhead is about 12ns with my CPU at 3GHz. The way to get around that is to load the method from a compiled assembly, like this:

var ab = AppDomain.CurrentDomain.DefineDynamicAssembly(
             new AssemblyName("assembly"), AssemblyBuilderAccess.Run);
var mod = ab.DefineDynamicModule("module");
var tb = mod.DefineType("type", TypeAttributes.Public);
var mb = tb.DefineMethod(
             "test3", MethodAttributes.Public | MethodAttributes.Static);
expression.CompileToMethod(mb);
var t = tb.CreateType();
var test3 = (Func<int, Foo>)Delegate.CreateDelegate(
                typeof(Func<int, Foo>), t.GetMethod("test3"));

int counter3 = 0;
Stopwatch s3 = new Stopwatch();
s3.Start();
for (int i = 0; i < 300000000; i++)
{
    counter3 += test3(i).Value;
}
s3.Stop();
var result3 = s3.Elapsed;

When I add the above code, result3 is always just a fraction of a second higher than result1, for about a 1ns overhead.

So why even bother with a compiled lambda (test2) when you can have a faster delegate (test3)? Because creating the dynamic assembly is much more overhead in general, and only saves you 10-20ns on each invocation.

Gabe
  • 84,912
  • 12
  • 139
  • 238
  • 2
    Really nice. I quickly wrapped this in an extension method and I got my "speed" back (increased by around 30-40%) Thanks ! :) – MartinF Nov 18 '10 at 17:32
  • 1
    FYI, in .NET 4.5 I measure no difference between a compiled expression and a compile-to-method approach outlined above. – Nick Strupat Sep 08 '15 at 21:29
6

(This is not a proper answer, but is material intended to help discover the answer.)

Statistics gathered from Mono 2.6.7 - Debian Lenny - Linux 2.6.26 i686 - 2.80GHz single core:

      Func: 00:00:23.6062578
Expression: 00:00:23.9766248

So on Mono at least both mechanisms appear to generate equivalent IL.

This is the IL generated by Mono's gmcs for the anonymous method:

// method line 6
.method private static  hidebysig
       default class Foo '<Main>m__0' (int32 x)  cil managed
{
    .custom instance void class [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::'.ctor'() =  (01 00 00 00 ) // ....

    // Method begins at RVA 0x2204
    // Code size 9 (0x9)
    .maxstack 8
    IL_0000:  ldarg.0
    IL_0001:  ldc.i4.2
    IL_0002:  mul
    IL_0003:  newobj instance void class Foo::'.ctor'(int32)
    IL_0008:  ret
} // end of method Default::<Main>m__0

I will work on extracting the IL generated by the expression compiler.

cdhowie
  • 158,093
  • 24
  • 286
  • 300
  • 1
    My concern is that the Mono runtime is not similar enough to the .Net runtime for the comparison to be useful. – Gabe Nov 18 '10 at 04:55
  • mono uses reflection.emit to compile c sharp so it makes sense that the code generated by expression trees is as fast. – Michael B Nov 18 '10 at 05:06
  • @Michael: So are you saying that the compiled expression trees are slow on Mono, or that compiled assemblies are fast? – cdhowie Nov 18 '10 at 05:39
  • @Gabe: The IL for this method should be simple enough that both runtimes compile the anonymous method and the expression tree to the same IL. You can't get any more optimized than the above IL. I am not suggesting that I will be extracting the compiled IL from the expression tree on Mono, since that is not the target of this question. But the above IL should serve as a good reference for comparison. (Also, Mono is all I have available to me at the moment.) – cdhowie Nov 18 '10 at 05:40
  • I've revised my answer to show the IL of both expressions. I think Mono might actually have an advantage, as C# dynamic module token resolver does a worse job when you have a lot of types in a dynamic method as it does an iteration, rather than the mono compiler which does some form of look up. Don't quote me though. – Michael B Nov 18 '10 at 05:49
4

Ultimately what it comes down to is that Expression<T> is not a pre compiled delegate. It's only an expression tree. Calling Compile on a LambdaExpression (which is what Expression<T> actually is) generates IL code at runtime and creates something akin to a DynamicMethod for it.

If you just use a Func<T> in code, it pre compiles it just like any other delegate reference.

So there are 2 sources of slowness here:

  1. The initial compilation time to compile Expression<T> into a delegate. This is huge. If you're doing this for every invocation - definitely don't (but this isn't the case, since you're using your Stopwatch after you call compile.

  2. It's a DynamicMethod basically after you call Compile. DynamicMethods (even strongly typed delegates for ones) ARE in fact slower to execute than direct calls. Func<T>s resolved at compile time are direct calls. There's performance comparisons out there between dynamically emitted IL and compile time emitted IL. Random URL: http://www.codeproject.com/KB/cs/dynamicmethoddelegates.aspx?msg=1160046

...Also, in your stopwatch test for the Expression<T>, you should start your timer when i = 1, not 0... I believe your compiled Lambda will not be JIT compiled until the first invocation, so there will be a performance hit for that first call.

nawfal
  • 70,104
  • 56
  • 326
  • 368
Jeff
  • 35,755
  • 15
  • 108
  • 220
  • While you're right about the stopwatch, it's irrelevant in this case because it takes only microseconds (maybe 3 on my computer) to JIT compile the lambda. – Gabe Nov 18 '10 at 05:14
  • True. It's still a well known fact that dynamically emitted methods invoke slower than precompiled ones. – Jeff Nov 18 '10 at 05:39
  • 1
    This is not true always by the way! I've written delegate, and actually rewritten them as expressions as the compiled expressions performed nearly twice as fast. – Michael B Nov 18 '10 at 05:51
  • 1
    Michael B: JeffN825 was saying that the dynamic methods *invoke* slower (by about 12ns on my machine), not that they *execute* slower. In other words, the function call overhead is higher. – Gabe Nov 18 '10 at 06:33
  • @Jeff: The actual `Func` (not the lambda) will not be JIT compiled until its first invocation either. The CLR JIT-compiles methods the first time they are entered. – cdhowie Nov 18 '10 at 07:26
  • Yes the invoke being slower makes sense as it's invoking a 2 argument function vs a 1 argument function. Which means it has to push a little bit more on the stack. – Michael B Nov 18 '10 at 13:43
  • Michael B: I changed the lambda to `(x,y) => new Foo(x*y)` so that it would be a 2-argument function and did *not* notice the expected increase in execution time, so I suspect that your hypothesis is wrong. – Gabe Nov 18 '10 at 16:36
  • @Jeff, Can you please explain how "The initial compilation time to compile Expression into a delegate" slows down the execution when it's outside the loop and not included in the measures? – Ark-kun Jan 11 '14 at 18:30
1

It is most likely because the first invocation of the code was not jitted. I decided to look at the IL and they are virtually identical.

Func<int, Foo> func = x => new Foo(x * 2);
Expression<Func<int, Foo>> exp = x => new Foo(x * 2);
var func2 = exp.Compile();
Array.ForEach(func.Method.GetMethodBody().GetILAsByteArray(), b => Console.WriteLine(b));

var mtype = func2.Method.GetType();
var fiOwner = mtype.GetField("m_owner", BindingFlags.Instance | BindingFlags.NonPublic);
var dynMethod = fiOwner.GetValue(func2.Method) as DynamicMethod;
var ilgen = dynMethod.GetILGenerator();


byte[] il = ilgen.GetType().GetMethod("BakeByteArray", BindingFlags.NonPublic | BindingFlags.Instance).Invoke(ilgen, null) as byte[];
Console.WriteLine("Expression version");
Array.ForEach(il, b => Console.WriteLine(b));

This code gets us the byte arrays and prints them to the console. Here is the output on my machine::

2
24
90
115
13
0
0
6
42
Expression version
3
24
90
115
2
0
0
6
42

And here is reflector's version of the first function::

   L_0000: ldarg.0 
    L_0001: ldc.i4.2 
    L_0002: mul 
    L_0003: newobj instance void ConsoleApplication7.Foo::.ctor(int32)
    L_0008: ret 

There are only 2 bytes different in the entire method! They are the first opcode, which is for the first method, ldarg0 (load the first argument), but on the second method ldarg1 (load the second argument). The difference here is because an expression generated object actually has a target of a Closure object. This can also factor in.

The next opcode for both is ldc.i4.2 (24) which means load 2 onto the stack, the next is the opcode for mul (90), the next opcode is the newobj opcode (115). The next 4 bytes are the metadata token for the .ctor object. They are different as the two methods are actually hosted in different assemblies. The anonymous method is in an anonymous assembly. Unfortunately, I haven't quite gotten to the point of figuring out how to resolve these tokens. The final opcode is 42 which is ret. Every CLI function must end with ret even functions that don't return anything.

There are few possibilities, the closure object is somehow causing things to be slower, which might be true (but unlikely), the jitter didn't jit the method and since you were firing in rapid spinning succession it didn't have to time to jit that path, invoking a slower path. The C# compiler in vs may also be emitting different calling conventions, and MethodAttributes which may act as hints to the jitter to perform different optimizations.

Ultimately, I would not even remotely worry about this difference. If you really are invoking your function 3 billion times in the course of your application, and the difference being incurred is 5 whole seconds, you're probably going to be ok.

nawfal
  • 70,104
  • 56
  • 326
  • 368
Michael B
  • 7,512
  • 3
  • 31
  • 57
  • 1
    Are you suggesting that it takes several seconds to JIT compile a function that contains 5 instructions? – Gabe Nov 18 '10 at 05:15
  • Im with you on the matter that I shouldnt really care about the small difference. But when you are writting a piece of software which will be judged on its performance, where tests cases like this will be used to meassure it against other competitors it does matter when you rely heavily on delegates and you suddenly see performance decrease by 30-40% compared to the direct approach. Luckily getting the Expression gives me a possibility to optimize what is going on in the lambda and make it even faster than the direct approach. – MartinF Nov 18 '10 at 16:42
  • A bigger thing to sell to higher ups, is that the Expression approach allows you to implement patterns and delegates easier than say doing it by hand. So while we can all agree that hand-tuned C# might be better, if you need specific code for each type or worse instance writing the myriad cases by hand will be a huge developing effort, where as you can write fairly well optimized expression tree that can dynamically generate code to tweak your behavior at runtime. – Michael B Nov 19 '10 at 05:31
1

Just for the record: I can reproduce the numbers with the code above.

One thing to note is that both delegates create a new instance of Foo for every iteration. This could be more important than how the delegates are created. Not only does that lead to a lot of heap allocations, but GC may also affect the numbers here.

If I change the code to

Func<int, int> test1 = x => x * 2;

and

Expression<Func<int, int>> expression = x => x * 2;
Func<int, int> test2 = expression.Compile();

The performance numbers are virtually identical (actually result2 is a little better than result1). This supports the theory that the expensive part is heap allocations and/or collections and not how the delegate is constructed.

UPDATE

Following the comment from Gabe, I tried changing Foo to be a struct. Unfortunately this yields more or less the same numbers as the original code, so perhaps heap allocation/garbage collection is not the cause after all.

However, I also verified the numbers for delegates of the type Func<int, int> and they are quite similar and much lower than the numbers for the original code.

I'll keep digging and look forward to seeing more/updated answers.

Brian Rasmussen
  • 114,645
  • 34
  • 221
  • 317
  • Thanks for your reply. I also noticed this behaviour, and wrote it as a comment to my own question. Fin blog iøvrigt :) – MartinF Nov 18 '10 at 15:43
  • 1
    I changed `Foo` from a class to a struct and noticed a 1-second decrease in times for both options, but otherwise the relative difference didn't decrease. I suspect that you may not be measuring what you think you are. – Gabe Nov 18 '10 at 16:30
  • @Gabe: I took the code from the question and changed the declarations as shown in my answer. I also made sure each delegate was called once before measuring time. I'll give structs a try and update my answer. – Brian Rasmussen Nov 18 '10 at 18:30
  • 1
    I think your numbers for `Func` are lower because the JIT compiler can do some optimizations (inlining, enregistering) that aren't applicable to all cases. – Gabe Nov 18 '10 at 19:18
  • @Gabe: That could very well be it. Next step is to compare the JIT compiled code between the two. – Brian Rasmussen Nov 18 '10 at 19:29
0

I was interested in the answer by Michael B. so I added in each case extra call before stopwatch even started. In debug mode the compile (case 2) method was faster nearly two times (6 seconds to 10 seconds), and in release mode both versions both version was on par (the difference was about ~0.2 second).

Now, what is striking to me, that with JIT put out of the equation I got the opposite results than Martin.

Edit: Initially I missed the Foo, so the results above are for Foo with field, not a property, with original Foo the comparison is the same, only times are bigger -- 15 seconds for direct func, 12 seconds for compiled version. Again, in release mode the times are similar, now the difference is about ~0.5.

However this indicates, that if your expression is more complex, even in release mode there will be real difference.

greenoldman
  • 16,895
  • 26
  • 119
  • 185