7

I have encountered a performance issue in .NET Core 2.1 that I am trying to understand. The code for this can be found here:

https://github.com/mike-eee/StructureActivation

Here is the relavant benchmark code via BenchmarkDotNet:

public class Program
{
    static void Main()
    {
        BenchmarkRunner.Run<Program>();
    }

    [Benchmark(Baseline = true)]
    public uint? Activated() => new Structure(100).SomeValue;

    [Benchmark]
    public uint? ActivatedAssignment()
    {
        var selection = new Structure(100);
        return selection.SomeValue;
    }
}

public readonly struct Structure
{
    public Structure(uint? someValue) => SomeValue = someValue;

    public uint? SomeValue { get; }
}

From the outset, I would expect Activated to be faster as it does not store a local variable, which I have always understood to incur a performance penalty to locate and reserve the space within the current stack context to do so.

However, when running the tests, I get the following results:

// * Summary *

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.285 (1803/April2018Update/Redstone4)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.402
  [Host]     : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT


              Method |     Mean |     Error |    StdDev | Scaled |
-------------------- |---------:|----------:|----------:|-------:|
           Activated | 4.700 ns | 0.0128 ns | 0.0107 ns |   1.00 |
 ActivatedAssignment | 3.331 ns | 0.0278 ns | 0.0260 ns |   0.71 |

The activated structure (without storing a local variable) is roughly 30% slower.

For reference, here is the IL courtesy of ReSharper's IL Viewer:

.method /*06000002*/ public hidebysig instance valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32> 
  Activated() cil managed 
{
  .custom /*0C00000C*/ instance void [BenchmarkDotNet/*23000002*/]BenchmarkDotNet.Attributes.BenchmarkAttribute/*0100000D*/::.ctor() 
    = (01 00 01 00 54 02 08 42 61 73 65 6c 69 6e 65 01 ) // ....T..Baseline.
    // property bool 'Baseline' = bool(true)
  .maxstack 1
  .locals /*11000001*/ init (
    [0] valuetype StructureActivation.Structure/*02000003*/ V_0
  )

  // [14 31 - 14 59]
  IL_0000: ldc.i4.s     100 // 0x64
  IL_0002: newobj       instance void valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32>/*1B000001*/::.ctor(!0/*unsigned int32*/)/*0A00000F*/
  IL_0007: newobj       instance void StructureActivation.Structure/*02000003*/::.ctor(valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32>)/*06000005*/
  IL_000c: stloc.0      // V_0
  IL_000d: ldloca.s     V_0
  IL_000f: call         instance valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32> StructureActivation.Structure/*02000003*/::get_SomeValue()/*06000006*/
  IL_0014: ret          

} // end of method Program::Activated

.method /*06000003*/ public hidebysig instance valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32> 
  ActivatedAssignment() cil managed 
{
  .custom /*0C00000D*/ instance void [BenchmarkDotNet/*23000002*/]BenchmarkDotNet.Attributes.BenchmarkAttribute/*0100000D*/::.ctor() 
    = (01 00 00 00 )
  .maxstack 2
  .locals /*11000001*/ init (
    [0] valuetype StructureActivation.Structure/*02000003*/ selection
  )

  // [19 4 - 19 39]
  IL_0000: ldloca.s     selection
  IL_0002: ldc.i4.s     100 // 0x64
  IL_0004: newobj       instance void valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32>/*1B000001*/::.ctor(!0/*unsigned int32*/)/*0A00000F*/
  IL_0009: call         instance void StructureActivation.Structure/*02000003*/::.ctor(valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32>)/*06000005*/

  // [20 4 - 20 31]
  IL_000e: ldloca.s     selection
  IL_0010: call         instance valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32> StructureActivation.Structure/*02000003*/::get_SomeValue()/*06000006*/
  IL_0015: ret          

} // end of method Program::ActivatedAssignment

Upon inspection, Activated has two newobj whereas ActivatedAssignment only has one, which might be contributing to the difference between the two benchmarks.

My question is: is this expected? I am trying to understand why the benchmark with less code is actually slower than the one with more code. Any guidance/recommendations to ensure that I am following best practices would be greatly appreciated.

Ian Kemp
  • 28,293
  • 19
  • 112
  • 138
Mike-E
  • 2,477
  • 3
  • 22
  • 34
  • FWIW, "local variables" can be entirely eliminated with JIT. MSIL does not translate directly to 'performance efficiency'. – user2864740 Sep 29 '18 at 05:52
  • Ah, are you saying that this could be a JIT-related issue, @user2864740? While I had this question tagged for .NET Core, (and the results clearly show the runtime used), I have updated the issue to reflect that this is indeed occurring in .NET Core 2.1. Since .NET Core 2.1 has had a remarkable focus on performance, this only increases my suspicions (and confusion) around this issue. – Mike-E Sep 29 '18 at 06:07
  • 3
    It is pretty hard to develop an intuition for code like this. What you can't see is that it is very slow in both cases, the jitter optimizer gives up on the standard optimizations because you use a mutable struct type. uint? (aka `Nullable`) is ugly because its HasValue field needs to be assigned, the optimizer throws up its hand because it can't reason through the possible side-effects. Quite important that you also do this with a plain uint and compare, makes you think twice about using nullable types in perf-critical code. Allow the method to be inlined, put a for-loop around it. – Hans Passant Sep 29 '18 at 10:04
  • Funny enough @HansPassant I noticed that if I used non-nullable struct (rather than a nullable) on a method that I was calling it sped the results up considerably. It does seem that any time a nullable struct is in play there is a 10ns hit on the results in my case. It would be super handy to have tooling/analysis that pointed these types of issues out rather than spending days on days in trial/error and ultimately having to brave StackOverflow to see if any pointers can be had here. Thank you for the /information/insight. – Mike-E Sep 29 '18 at 14:01

1 Answers1

5

It's a bit more clear what's happening if you look at the JITted assembly from your methods:

Program.Activated()
L0000: sub rsp, 0x18
L0004: xor eax, eax              // Initialize Structure to {0}
L0006: mov [rsp+0x10], rax       // Store to stack
L000b: mov eax, 0x64             // Load literal 100
L0010: mov edx, 0x1              // Load literal 1
L0015: xor ecx, ecx              // Initialize SomeValue to {0}
L0017: mov [rsp+0x8], rcx        // Store to stack
L001c: lea rcx, [rsp+0x8]        // Load pointer to SomeValue from stack
L0021: mov [rcx], dl             // Set SomeValue.HasValue to 1
L0023: mov [rcx+0x4], eax        // Set SomeValue.Value to 100
L0026: mov rax, [rsp+0x8]        // Load SomeValue's value from stack
L002b: mov [rsp+0x10], rax       // Store it to a different location on stack
L0030: mov rax, [rsp+0x10]       // Return it from that location
L0035: add rsp, 0x18
L0039: ret

Program.ActivatedAssignment()
L0000: push rax
L0001: xor eax, eax              // Initialize SomeValue to {0}
L0003: mov [rsp], rax            // Store to stack
L0007: mov eax, 0x64             // Load literal 100
L000c: mov edx, 0x1              // Load literal 1
L0011: lea rcx, [rsp]            // Load pointer to SomeValue from stack
L0015: mov [rcx], dl             // Set SomeValue.HasValue to 1
L0017: mov [rcx+0x4], eax        // Set SomeValue.Value to 100
L001a: mov rax, [rsp]            // Return SomeValue
L001e: add rsp, 0x8
L0022: ret

Obviously, Activated() is doing more work, and that's why it's slower. What it boils down to is a lot of stack shuffling (all references to rsp). I've commented them as best I could, but the Activated() method is a bit convoluted because of the redundant movs. ActivatedAssigment() is much more straightforward.

Ultimately, you're not actually saving stack space by omitting the local variable. The variable has to exist at some point whether you give it a name or not. The IL code you pasted shows a local variable (they call it V_0) which is the temp created by the C# compiler since you didn't create it explicitly.

Where the two differ is that the version with the temp variable only reserves a single stack slot (.maxstack 1), and it uses it for both the Nullable<T> and the Structure, hence the shuffling. In the version with the named variable, it reserves 2 slots (.maxstack 2).

Ironically, in the version with the pre-reserved local variable for selection, the JIT is able to eliminate the outer structure and deal only with its embedded Nullable<T>, making for cleaner/faster code.

I'm not sure you can deduce any best practices from this example, but I think it's easy enough to see that the C# compiler is the source of the perf difference. The JIT is smart enough to do the right thing with your struct but only if it looks a certain way coming in.

saucecontrol
  • 1,446
  • 15
  • 17
  • Wow, thank you for the informative and thorough answer! I'm still feeling my way around disassembly, let alone JIT, so I appreciate the comments for sure. As it seems the compiler (Roslyn) is what is responsible for this situation, my remaining thought here is: do you feel this should be brought up on Roslyn's repo? As such, I suppose when I said "best practices," I was aiming for tooling and environmental guidance as well. For instance, if you know of any Roslyn analyzers that point out these sorts of issues, a pointer to them would be greatly appreciated. Thanks again! – Mike-E Sep 29 '18 at 10:05
  • 1
    That's a good question. It certainly wouldn't hurt to let the Roslyn team know they could improve the IL in this case to help the JIT out. Worst they can say is no ;). I can't think of any tools that would have preemptively warned you here, but if you haven't tried it, I recommend sharplab.io for exploring the relationship between C#, IL, and Asm. – saucecontrol Sep 29 '18 at 17:38
  • 2
    Here's a link to the setup I used for your sample: https://sharplab.io/#v2:EYLgHgbALANALiAhgZwLYB8ACAGABJgRgG4BYAKEwGZ8AmfAgdnIG9zd36J8pcBZRAJYA7ABQBKNh1ZkOs3ADdEAJ1wAHXAF5cQgKYB3eg3GkZc9qoB0AQQDGcAYrg6AJsclnLt+45dXkyAQBzIVQdITg3U3YAX3J3fGoAV2E4AH5cLwdEJ1cxTQA+XBFdAwBlOCVEu0SlHRECbGwxMQtSgHtQgDVEABtEnRN4qlxk8PTMn2c/AODQ8PF46TN2RRVkHR6dOwE2oU1tfVxyyura+saxE2X8Blx1ze3d1o6dbr6B+NiyL/Jh2sRnLsegBPO4VKpwI7g046FhDajHCE1OqjNJ3F5vfp5DSFdpdXr9fbIDEEj5keEjFLpPGvUm4Zi4QI6OBEXBfaJAA= – saucecontrol Sep 29 '18 at 17:38
  • 1
    Excellent! I had no idea you could derive JIT from a webpage. :) That is very much what I am interested in adding to my resources. One more factor I will consider before braving StackOverflow. Thanks again. – Mike-E Oct 01 '18 at 04:39
  • 1
    FWIW, I have posted an issue to Roslyn's repo here: https://github.com/dotnet/roslyn/issues/30284 – Mike-E Oct 03 '18 at 11:38
  • > I think it's easy enough to see that the C# compiler is the source of the perf difference_ Why specifically do you believe the C# compiler is to blame here? The IL emitted is legal and representative of what the source code asked for. The runtime is the one choosing to optimize one path a specific way here, not C#. – JaredPar Oct 05 '18 at 16:32
  • I only say that because the IL itself is different. Since Roslyn creates a temp local variable, it could have emitted the same code for that as it does when the local is explicit. – saucecontrol Oct 05 '18 at 20:17