Can the C# compiler or JIT optimize away a method call in a lambda expression?

Question

I'm starting this question after a discussion which started (in comments) on another StackOverflow question, and I'm intrigued to know the answer. Considering the following expression:

var objects = RequestObjects.Where(r => r.RequestDate > ListOfDates.Max());

Will there be any (performance) advantage of moving the evaluation of ListOfDates.Max() out of the Where clause in this case, or will 1. the compiler or 2. JIT optimize this away?

I believe C# will only do constant folding at compile time, and it could be argued that ListOfDates.Max() can not be known at compile time unless ListOfDates itself is somehow constant.

Perhaps there is another compiler (or JIT) optimization that makes sure that this is only evaluated once?

The compiler probably cannot because it cannot prove that accessing the `r.RequestDate` property will not modify `ListOfDates`. The JIT might recognize this, but might not. There is no requirement to do so. — Raymond Chen, Apr 07 '16 at 00:04
@RaymondChen I agree but I think it is even simpler; as long as `ListOfDates` is not known to be immutable, constant folding cannot be safely done. — Lucero, Apr 07 '16 at 00:09
I think it is important to mention that ListOfDates is List (or DateTime[]) and RequestObject.RequestDate is variable of DateTime type. — Riad Baghbanli, Apr 07 '16 at 17:59

atlaste · Accepted Answer · 2016-04-09T20:42:59.083

Well, it's a bit of a complex answer.

There are two things involved here. (1) the compiler and (2) the JIT.

The compiler

Simply put, the compiler just translates your C# code to IL code. It's a pretty trivial translation for most cases and one of the core ideas of .NET is that each function is compiled as an autonomous block of IL code.

So, don't expect too much from the C# -> IL compiler.

The JIT

That's... a bit more complicated.

The JIT compiler basically translates your IL code to assembler. The JIT compiler also contains an SSA based optimizer. However, there's a time limit, because we don't want to wait too long before our code starts to run. Basically this means that the JIT compiler doesn't do all the super cool stuff that will make your code go extremely fast, simply because that would cost too much time.

We can of course just put it to the test :) Ensure VS will optimize when you run (options -> debugger -> uncheck suppress [...] and just my code), compile in x64 release mode, put a breakpoint and see what happens when you switch to assembler view.

But hey, what's the fun in only having theory; let's put it to the test. :)

static bool Foo(Func<int, int, int> foo, int a, int b)
{
    return foo(a, b) > 0;  // put breakpoint on this line.
}

public static void Test()
{
    int n = 2;
    int m = 2;
    if (Foo((a, b) => a + b, n, m)) 
    {
        Console.WriteLine("yeah");
    }
}

First thing you should notice is that the breakpoint is hit. This already tells that the method ain't inlined; if it were, you wouldn't hit the breakpoint at all.

Next, if you watch the assembler output, you'll notice a 'call' instructions using an address. Here's your function. On closer inspection, you'll notice that it's calling the delegate.

Now, basically this means that the call is not inlined, and therefore is not optimized to match the local (method) context. In other words, not using delegates and putting stuff in your method is probably faster than using delegates.

On the other hand, the call is pretty efficient. Basically the function pointer is simply passed and called. There's no vtable lookup, just a simple call. This means it probably beats calling a member (e.g. IL callvirt). Still, static calls (IL call) should be even faster, since these are predictable compile-time. Again, let's test, shall we?

public static void Test()
{
    ISummer summer = new Summer();
    Stopwatch sw = Stopwatch.StartNew();
    int n = 0;
    for (int i = 0; i < 1000000000; ++i)
    {
        n = summer.Sum(n, i);
    }
    Console.WriteLine("Vtable call took {0} ms, result = {1}", sw.ElapsedMilliseconds, n);

    Summer summer2 = new Summer();
    sw = Stopwatch.StartNew();
    n = 0;
    for (int i = 0; i < 1000000000; ++i)
    {
        n = summer.Sum(n, i);
    }
    Console.WriteLine("Non-vtable call took {0} ms, result = {1}", sw.ElapsedMilliseconds, n);

    Func<int, int, int> sumdel = (a, b) => a + b;
    sw = Stopwatch.StartNew();
    n = 0;
    for (int i = 0; i < 1000000000; ++i)
    {
        n = sumdel(n, i);
    }
    Console.WriteLine("Delegate call took {0} ms, result = {1}", sw.ElapsedMilliseconds, n);

    sw = Stopwatch.StartNew();
    n = 0;
    for (int i = 0; i < 1000000000; ++i)
    {
        n = Sum(n, i);
    }
    Console.WriteLine("Static call took {0} ms, result = {1}", sw.ElapsedMilliseconds, n);
}

Results:

Vtable call took 2714 ms, result = -1243309312
Non-vtable call took 2558 ms, result = -1243309312
Delegate call took 1904 ms, result = -1243309312
Static call took 324 ms, result = -1243309312

The thing here that's interesting is actually the latest test result. Remember that static calls (IL call) are completely deterministic. That means it's a relatively simple thing to optimize for the compiler. If you inspect the assembler output, you'll find that the call to Sum is actually inlined. This makes sense. Actually, if you would test it, just putting the code in the method is just as fast as the static call.

A small remark about Equals

If you measure performance of hash tables, something seems fishy with my explanation. It appears as-if IEquatable<T> makes things go faster.

Well, that's actually true. :-) Hash containers use IEquatable<T> to call Equals. Now, as we all know, objects all implement Equals(object o). So, the containers can either call Equals(object) or Equals(T). The performance of the call itself is the same.

However, if you also implement IEquatable<T>, the implementation usually looks like this:

bool Equals(object o)
{
    var obj = o as MyType;
    return obj != null && this.Equals(obj);
}

Furthermore, if MyType is a struct, the runtime also needs to apply boxing and unboxing. If it would just call IEquatable<T>, none of these steps would be necessary. So, even though it appears slower, this has nothing to do with the call itself.

Your questions

Will there be any (performance) advantage of moving the evaluation of ListOfDates.Max() out of the Where clause in this case, or will 1. the compiler or 2. JIT optimize this away?

Yes, there will be an advantage. The compiler / JIT won't optimize it away.

I believe C# will only do constant folding at compile time, and it could be argued that ListOfDates.Max() can not be known at compile time unless ListOfDates itself is somehow constant.

Actually, if you change the static call to n = 2 + Sum(n, 2) you'll notice that the assembler output will contain a 4. Which proves that the JIT optimizer does do constant folding. (Which is quite obvious actually if you know about how SSA optimizers work... const folding and simplification are called a few times).

The function pointer itself isn't optimized. It might be in the future though.

Perhaps there is another compiler (or JIT) optimization that makes sure that this is only evaluated once?

As for 'another compiler', if you're willing to add 'another language', you can use C++. In C++ these kinds of calls are sometimes optimized away.

More interestingly, Clang is based on LLVM, and there are a few C# compilers for LLVM as well. I believe Mono has an option to optimize to LLVM, and CoreCLR was working on LLILC. While I haven't tested this, LLVM can definitely do these kinds of optimizations.

The proof that `ListOfDates.Max()` is constant is very complicated. It requires, for example, proving that the property call `r.RequestDate` does not change the list. Given that the current JIT is really poor at optimizing anything that is not basic this will not happen. The current JIT loads x twice from memory in `a.x + a.x`. It is that unsophisticated. — usr, Apr 09 '16 at 22:15
Yeah the JIT is pretty much optimized for warm-up time (the time it takes until it can execute the method). As a result, more complex optimizations are simply not performed... and I agree that this is a very complicated case to optimize - I don't even think that modern C++ Compilers would see through the enumerations, inheritance and data access patterns that are used in the OP's example. — atlaste, Apr 11 '16 at 08:30
The Hotspot JVM JIT achieves faster startup time and higher code quality through tiering. This is not a technical problem. It is a problem of lacking investment. The .NET people are just used to this dismal state. — usr, Apr 11 '16 at 10:04
@usr I'm just interested, what do you base this on? From the benchmarks that I've found online as well as the ones that I've constructed in the past, it really depends on the application. My personal experience is that it's overall roughly the same speed, with the exception that JVM seems to beat .NET for 'real' (vectorizable) calculations. Still, new insights on this subject are always interesting. — atlaste, Apr 12 '16 at 08:34
I did extensive testing on the code quality of RyuJIT (like 50+ tests where I analyzed the generated code). Hotspot does many things that go beyond. "Roughly the same speed" is true with maybe a 20% differential. You don't notice that difference in practice but it's there. I find that vexing considering how many servers are running .NET (100M?). That's quite a lot of money wasted due to a sub-par JIT.; And I don't understand the JIT team strategy at all. They struggle to give adequate startup time and good code. JIT tiering achieves both. You don't even need Hotspot-level sophistication. — usr, Apr 12 '16 at 09:06
In my mind they should have an interpreted tier 1 (super low startup time) and then re-JIT provably hot methods using a sophisticated backend, maybe based on VC or LLVM with all optimizations turned to 11. Almost all code is cold, so the 2nd tier throughput does not matter much. — usr, Apr 12 '16 at 09:08
Hmm. My own benchmarks gave +/- 5% difference, sometimes in favor of .NET, sometimes in favor of JVM, with both codes heavily optimized. These (10+) tests reflected the critical path of our real world application (which was a search engine). To be fair, I also implemented the same tests in heavily optimized C++, which clearly outperformed both, most of the cases by a factor 2-4. The irony here is that a literal translation gave roughly the same speed; most performance was gained by optimizing memory access. Still, I completely agree with your assessment; I would design it exactly like you. — atlaste, Apr 12 '16 at 09:32
I have awarded the bounty, well deserved, thanks for the detailed writeup! — ChristopheD, Apr 15 '16 at 13:16
@ChristopheD You're quite welcome; I always like questions about performance and optimization :-) — atlaste, Apr 19 '16 at 06:55

Can the C# compiler or JIT optimize away a method call in a lambda expression?

1 Answers1

Linked