How do the .NET JITs optimise generated code layout?

Question

Back in 2009 I posted this answer to a question about optimisations for nested try/catch/finally blocks.

Thinking about this again some years later, it seems the question could be extended to that other control flow, not only try/catch/finally, but also if/else.

At each of these junctions, execution will follow one path. Code must be generated for both, obviously, but the order in which they're placed in memory, and the number of jumps required to navigate through them will differ.

The order generated code is laid out in memory has implications for the miss rate on the CPU's instruction cache. Having the instruction pipeline stalled, waiting for memory reads, can really kill loop performance.

I don't think loops (for/foreach/while) are a such a good fit unless you expect the loop have zero iterations more often than it has some, as the natural generation order seems pretty optimal.

Some questions:

In what ways do the available .NET JITs optimise for generated instruction order?
How much difference can this make in practice to common code? What about perfectly suited cases?
Is there anything the developer can do to influence this layout? What about mangling with the forbidden goto?
Does the specific JIT being used make much difference to layout?
Does the method inlining heuristic come into play here too?
Basically anything interesting related to this aspect of the JIT!

Some initial thoughts:

Moving catch blocks out of line is an easy job, as they're supposed to be the exceptional case by definition. Not sure this happens.

For some loops I suspect you can increase performance non-trivially. However in general I don't think it'll make that much difference.

I don't know how the JIT decides the order of generated code. In C on Linux you have likely(cond) and unlikely(cond) which you can use to tell to the compiler which branch is the common path to optimise for. I'm not sure that all compilers respect these macros.

Instruction ordering is distinct from the problem of branch prediction, in which the CPU guesses (on its own, afaik) which branch will be taken in order to start the pipeline (oversimplied steps: decode, fetch operands, execute, write back) on instructions, before the execute step has determined the value of the condition variable.

I can't think of any way to influence this order in the C# language. Perhaps you can manipulate it a bit by gotoing to labels explicitly, but is this portable, and are there any other problems with it?

Perhaps this is what profile guided optimisation is for. Do we have that in the .NET ecosystem, now or in plan? Maybe I'll go and have a read about LLILC.

Well the good thing is that the JIT can optimize using runtime metrics. It can find out what path is better after a few loops. — Caramiriel, Jun 24 '16 at 21:19
Some sources as reference: https://github.com/dotnet/coreclr/blob/0851b88a70d228ffb833b7795d18329f4383999a/src/jit/block.h#L271 / https://github.com/dotnet/coreclr/blob/775003a4c72f0acc37eab84628fcef541533ba4e/src/inc/corprof.idl#L3573 — Caramiriel, Jun 24 '16 at 21:51
@Caramiriel yes, the JVM does that to great effect. Unfortunately, the .NET JITs are quite poor. No runtime profiling. — usr, Jun 24 '16 at 22:02
It is just not the jitter's job. Moving code around requires using the mpgo.exe (managed profile guided optimization) tool and Ngen.exe to use the profile data. Accurate to +/- 2%, nobody actually uses it, somewhat evident from nobody at SO ever asking questions about it. I know it keeps the I-cache hot, that's easy to do. Whether it alters branches is a blind guess, Microsoft is not doing a heckofalot of bragging about it and it isn't covered in detail by a channel9 video. The payoff just isn't there, PGO gets you only the proverbial 15% improvement. — Hans Passant, Jun 24 '16 at 22:42
Quick follow up. I modified my serialisation/deserialisation library to emit exceptional (unlikely) code *after* the normal straight-line code. Benchmarking shows a 14.3% improvement for deserialisation (during which a lot of validation occurs). Didn't need any PGO for that boost, just a hunch. — Drew Noakes, Jul 08 '16 at 23:37
Further, I don't think my changes are anything that the compiler or JIT couldn't do automatically without breaking much sweat. The payoff only comes when the byte count of exception handling code sizes up against the non-exceptional code (i.e. it had little impact on serialisation for which almost no validation occurs). — Drew Noakes, Jul 08 '16 at 23:39
@HansPassant turns out some CoreCLR maintainers think it is the jitter's job. They merged a pull request related to this [here](https://github.com/dotnet/coreclr/pull/6103) where they call what I'm referring to moving throw code to _cold blocks_. — Drew Noakes, Aug 19 '16 at 21:47

score 2 · Accepted Answer · answered Nov 22 '16 at 13:11

The optimization you are referring to is called the code layout optimization which is defined as follows:

Those pieces of code that are executed close in time in the same thread should be be close in the virtual address space so that they fit in a single or few consecutive cache lines. This reduces cache misses.
Those pieces of code that are executed close in time in different threads should be be close in the virtual address space so that they fit in a single or few consecutive cache lines as long as there is no self-modifying code. This gets lower priority than the previous one. This reduces cache misses.
Those pieces of code that are executed frequently (hot code) should be close in the virtual address space so that they fit in as few virtual pages as possible. This reduces page faults and working set size.
Those pieces of code that are rarely executed (cold code) should be close in the virtual address space so that they fit in as few virtual pages as possible. This reduces page faults and working set size.

Now to your questions.

In what ways do the available .NET JITs optimise for generated instruction order?

"Instruction order" is really a very general term. Many optimizations affect instruction order. I'll assume that you're referring to code layout.

JITters by design should take the minimum amount of time to compile code while at the same time produce high-quality code. To achieve this, they only perform the most important optimizations so that it's really worth spending time doing them. Code layout optimization is not one of them because without profiling, it may not be beneficial. While a JITter can certainly perform profiling and dynamic optimization, there is a generally preferred way.

How much difference can this make in practice to common code? What about perfectly suited cases?

Code layout optimization by itself can improve overall performance typically by -1% (negative one) to 4%, which is enough to make compiler writers happy. I would like to add that it reduces energy consumption indirectly by reducing cache misses. The reduction in miss ratio of the instruction cache can be typically up to 35%.

Is there anything the developer can do to influence this layout? What about mangling with the forbidden goto?

Yes, there are numerous ways. I would like to mention the generally recommended one which is mpgo.exe. Please do not use goto for this purpose. It's forbidden.

Does the specific JIT being used make much difference to layout?

No.

Does the method inlining heuristic come into play here too?

Inlining can indeed improve code layout with respect to function calls. It's one of the most important optimizations and all .NET JITs perform it.

Moving catch blocks out of line is an easy job, as they're supposed to be the exceptional case by definition. Not sure this happens.

Yes it might be "easy", but what is the potential gained benefit? catch blocks are typically small in size (containing a call to a function that handles the exception). Handling this particular case of code layout does not seem promising. If you really care, use mpgo.exe.

I don't know how the JIT decides the order of generated code. In C on Linux you have likely(cond) and unlikely(cond) which you can use to tell to the compiler which branch is the common path to optimise for.

Using PGO is much more preferable over using likely(cond) and unlikely(cond) for two reasons:

The programmer might inadvertently make mistakes while placing likely(cond) and unlikely(cond) in the code. It actually happens a lot. Making big mistakes while trying to manually optimize the code is very typical.
Adding likely(cond) and unlikely(cond) all over the code makes it less maintainable in the future. You'll have to make sure that these hints hold every time you change the source code. In large code bases, this could be ( or rather is) a nightmare.

Instruction ordering is distinct from the problem of branch prediction...

Assuming you are talking about code layout, yes they are distinct. But code layout optimization is usually guided by a profile which really includes branch statistics. Hardware branch prediction is of course totally different.

Maybe I'll go and have a read about LLILC.

While using mpgo.exe is the mainstream way of performing this optimization, you can use LLILC also since LLVM support profile-guided optimization as well. But I don't think you need to go this far.

Thanks very much for the detailed answer. And yes I was referring to what you refer to as code layout (I assume instruction ordering is at a finer level of granularity). In my case I don't think PGO will work as I'm generating serialisation/deserialisation code using Reflection.Emit. As I mentioned in the comments, I observed significantly greater perf improvements than you quote by moving catch blocks out of the hot path (14.3%) for deserialisation. However, I didn't move catch blocks but rather blocks that culminate in throws. For large methods with several throws, this might make sense. — Drew Noakes, Nov 22 '16 at 22:58
@DrewNoakes The perf improvement I mentioned is from real world large programs and this single optimization cannot really do much. But I am sure you can design benchmarks that demonstrate greater improvements, but hopefully modeling your specific real world scenario. Indeed, mpgo.exe in its current form doesn't work with `Reflection.Emit`. I would like to add also that ProfileOptimization currently cannot handle `Reflection.Emit`. If you provide me with more details about how/when/what/who exactly you are using `Reflection.Emit`, I may suggest a reasonably easy solution. — Hadi Brais, Nov 23 '16 at 04:07
[Here's the commit](https://github.com/drewnoakes/dasher/commit/c159d8f6030e1d3ad777a7edcadeb028b811cc94) in which I introduced the change to 'gather' all throwing blocks of code to be emitted at the end of the `DynamicMethod`'s IL stream. So, assuming deserialisation occurs successfully, the code takes a pretty linear path without too many jumps. That is, the branches to these deferred throw blocks are not taken. I should note that I made that commit after asking this question. — Drew Noakes, Nov 23 '16 at 15:29
@DrewNoakes Can you tell me on which microarhcitecture and cache organization you got the 14.3% speedup? — Hadi Brais, Nov 24 '16 at 04:58
On a Dell XPS 15 (9550) with Skylake i7-6700HQ @ 2.6GHz. Benchmarks involved deserialising a pre-serialised class with int/long/float/double/string/byte[10]/DateTime/TimeSpan/Guid properties from a stream, tested using [BenchmarkDotNet 0.9.7](https://github.com/dotnet/BenchmarkDotNet). — Drew Noakes, Nov 24 '16 at 09:50

How do the .NET JITs optimise generated code layout?

1 Answers1