2

Say you're developing in a JIT-compiled language. Is there any performance downside to making your functions very large, in terms of the code size of the generated assembly?

I ask because I was looking through the source code of Buffer.MemoryCopy in C# the other day, which is obviously a very performance-sensitive method. It appears they use a large switch statement to specialize the function for all byte counts <= 16, resulting in some pretty gigantic generated assembly.

Are there any cons, performance-wise, to this approach? For example, I noticed the glibc and FreeBSD implementations of memmove do not do this, in spite of the fact that C is AOT-compiled, meaning it doesn't suffer from the cost of JIT precompilation (which is one downside)-- for C#, the JIT waits until the first call to compile the method, and so for really long methods the first invocation will take longer.

What are the up/downsides to having a gigantic switch statement and increasing code size (other than the precompilation cost I just mentioned) for JIT-ed languages? Thanks. (I'm a bit new to assembly so please go easy on me :) )

Community
  • 1
  • 1
James Ko
  • 32,215
  • 30
  • 128
  • 239
  • Large switches in .NET are generally compiled to jump tables (definitely the case with int switches), sure the compilation will take longer, but the method will execute faster (than if you used multiple if statements for example). I can't comment on the specifics of those functions, but comparing it to C is not worthwhile, optimised C will smash .NET out of the water performace wise with I'd venture to say any equivalent function. – starlight54 Jul 20 '16 at 02:10
  • @starlight54 Thank you for your input. I'm aware the code compiles to a jump table (I can see it at the `jmp rax`). I'm just asking in general, for any language, what the point of trying to keep the code size small is (especially with JIT-compiled ones since the generated code is never actually persisted to disk). – James Ko Jul 20 '16 at 02:21
  • I've personally never heard of any scenario where the actual code character count is a problem to be fixed. I think the main reasons for having less code are not wasting your time writing unnecessary characters, 'less code, less bugs' (although I'm not too sure on that one :D), and readability (but that can fall either way). – starlight54 Jul 20 '16 at 02:41
  • Code size matters to cache efficiency. But only executing code and a lot of that code doesn't. In practice the *len* argument is not a random number. – Hans Passant Jul 20 '16 at 09:44

1 Answers1

5

Assuming x86.

Fetching1 and decoding2 instructions is not for free.

Similarly to data cache, the CPU has a code cache; but it is usually smaller, ranging from 8 KiB to 32 KiB.
A shorter code fits better in the I-cache, requiring less fetches from memory.

Fetching, however, is only half of the story.
The x86 is historically problematic when it comes to decoding, due to its (very) variable length instructions. There has been, and there, are various patterns to follow and limitations to workaround to reach a fast decoding.

Since the Core2 architecture, the CPU has other instruction caches that sit after the decoders3.
These caches holds the already decoded instructions, bypassing the limitations and latency of the previous stages.

Just to have a mental idea I sketched the Haswell decoding unit4:

                                              Haswell decoding

Each arrow is a step in the data path that usually takes a clock.
The dark shaded areas are where an instruction can be found.

The closer a cache is to the Out of Order core5, mean to be at the bottom, the faster an instruction in said cache can reach the core.
However the closer the cache the smaller it becomes, so reducing the code size improve performance, specially for the critical loops6.


I draw these conclusion based on the analysis of Agner Fog.


1 The act of reading from memory.
2 The operation of converting an instruction into micro operations.
3 Pre-decoders for Core2, but still.
4 Peter, you are welcome to point out the mistakes :).
5 The part of the CPU that effectively execute the instructions.
6 Loops mean to be executed often.

Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
  • And just to add to this from other perspective. The JIT is an compiler. So it takes some time, even for short and simple code, compiling is not trivial. If the JIT has to deal with 30 or 130 lines of C#, it's not that relevant, it will be slower with 130 lines, but when compared to no-compilation it will be huge hit in both cases, the difference being unimportant then. With JIT you may really get into trouble with some generated source, I have seen questions on SO with C++ generated code being almost 1MiB big, which was making the C++ compiler fail. JIT would choke for sure too. – Ped7g Jul 20 '16 at 10:54
  • @Ped7g I actually redact the part about the JIT compilation overhead. I read more about the project, and it appears they use a tool called [crossgen](https://github.com/dotnet/coreclr/blob/master/Documentation/building/crossgen.md) to essentially pre-jit their code to native images, since the BCL is so large. All of this is done before the app is actually run and so there's no extra cost (jit-wise) for large methods. – James Ko Jul 22 '16 at 05:18