Inline string performance

Question

The question is simple. Out of the two code snippets below, why is the top one faster (by roughly ~15%, including the entire "flow" of the application) than the one below.

Very important to note is that "m_shouldThrow" is always set to false in both tests, and does not change during the test.

Measurements were made using BenchMarkDotNet on .Net Core 2.1

Faster code:

private bool EnsureReadSize(int bitCount)
{
    if (m_offset + (uint)bitCount <= m_fullBitLength)
        return m_isValid = true;

    ThrowException();

    return m_isValid = false;
}

private void ThrowException()
{
    if (m_shouldThrow)
        throw new InvalidOperationException("Inner buffer is exceeded.");
}

Slower code:

private bool EnsureReadSize(int bitCount)
{
    if (m_offset + (uint)bitCount <= m_fullBitLength)
        return m_isValid = true;

    if (m_shouldThrow)
        throw new InvalidOperationException("Inner buffer is exceeded.");

    return m_isValid = false;
}

Edit: The following code also runs equally as fast as the first snippet:

private bool EnsureReadSize(int bitCount)
{
    if (m_offset + (uint)bitCount <= m_fullBitLength)
        return m_isValid = true;

    if (m_shouldThrow)
        throw new InvalidOperationException();

    return m_isValid = false;
}

What about sticking a breakpoint in that method, running the application and opening Debug -> Windows -> Disassembly window in the Visual Studio? Then you could clearly see what the compiler actually generated. — Thinko, Feb 24 '21 at 19:03
Debug output is going to be different than Release, but regardless; I already analyzed the Release output with the latest version of Roslyn. The only difference is a different constructor call for the exception, and an extra ldstr argument that directly loads the constant string. However, both of these occur AFTER a "ret" instruction. The only thing I can imagine is that the JIT loads this string on every call to EnsureReadSize, even if it's not used. — Dennis19901, Feb 24 '21 at 19:09
You can use the Disassembly window in Release mode too. Actually, it only makes sense to do it in Release mode :) — Thinko, Feb 24 '21 at 19:11
Do you get the same results if m_shouldThrow is not constant? Just speculating, but you may be seeing some sort of compiler optimisation where it can see that the ThrowException method will never do anything and so just leaves it out. — Plasmadog, Feb 24 '21 at 19:46
I wonder if [branch prediction](https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array) is having an effect — Charlieface, Feb 24 '21 at 21:31
@Plasmadog m_shouldThrow is not constant. It's set to false and doesn't change during this test. If set to constant however, both pieces of code run equally as fast. — Dennis19901, Feb 24 '21 at 23:40
@Charlieface Not expecting that to be the case. Unless it would predict the branch incorrectly for hundreds of millions of consecutive runs every time. — Dennis19901, Feb 24 '21 at 23:42
@Dennis19901 How many runs is I think irrelevant, branch prediction doesn't have any kind of learning built in AFAIK *edit I take that back, it does seem to have learning*. It is certainly possible that the exact ASM code generated by the JIT compiler is causing branch prediction to go one way rather than the other. Perhaps try flipping the condition `if(!m_shouldThrow) return ...; throw ...;` — Charlieface, Feb 24 '21 at 23:54
@Charlieface Just gave that a shot. Makes no difference at all. Either removing the string, or putting the string in a different method makes the difference. — Dennis19901, Feb 25 '21 at 00:00

Slugart · Answer 1 · 2021-02-25T19:48:43.397

It probably comes down to code size and inlining rules. The first function will be smaller and will not contain a throw and therefore might be able to be inlined into the calling method. This won't show up on a microbenchmark of just this method as the inlining will not be performed but in real code this can have an impact especially if the method is in an inner loop.

This issue contains more detailed discussion: https://github.com/dotnet/runtime/issues/4381

The assasembly code generated on SharpLab:

C.EnsureReadSize1(Int32)
    L0000: push edi
    L0001: push esi
    L0002: mov esi, ecx
    L0004: mov eax, [esi+4]
    L0007: mov ecx, eax
    L0009: sar ecx, 0x1f
    L000c: add edx, eax
    L000e: adc ecx, 0
    L0011: mov eax, [esi+8]
    L0014: mov edi, eax
    L0016: sar edi, 0x1f
    L0019: cmp eax, edx
    L001b: sbb edi, ecx
    L001d: jl short L002b
    L001f: mov byte ptr [esi+0xc], 1
    L0023: mov eax, 1
    L0028: pop esi
    L0029: pop edi
    L002a: ret
    L002b: mov ecx, esi
    L002d: call dword ptr [0x1a44c70c]
    L0033: mov byte ptr [esi+0xc], 0
    L0037: xor eax, eax
    L0039: pop esi
    L003a: pop edi
    L003b: ret

C.ThrowException()
    L0000: push esi
    L0001: cmp byte ptr [ecx+0xd], 0
    L0005: jne short L0009
    L0007: pop esi
    L0008: ret
    L0009: mov ecx, 0x6005e28
    L000e: call 0x05b630d0
    L0013: mov esi, eax
    L0015: mov ecx, 1
    L001a: mov edx, 0x1a44c040
    L001f: call 0x72184e00
    L0024: mov edx, eax
    L0026: mov ecx, esi
    L0028: call System.InvalidOperationException..ctor(System.String)
    L002d: mov ecx, esi
    L002f: call 0x72178100
    L0034: int3

C.EnsureReadSize2(Int32)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: push edi
    L0004: push esi
    L0005: mov eax, [ecx+4]
    L0008: mov esi, eax
    L000a: sar esi, 0x1f
    L000d: add edx, eax
    L000f: adc esi, 0
    L0012: mov eax, [ecx+8]
    L0015: mov edi, eax
    L0017: sar edi, 0x1f
    L001a: cmp eax, edx
    L001c: sbb edi, esi
    L001e: jl short L002d
    L0020: mov byte ptr [ecx+0xc], 1
    L0024: mov eax, 1
    L0029: pop esi
    L002a: pop edi
    L002b: pop ebp
    L002c: ret
    L002d: cmp byte ptr [ecx+0xd], 0
    L0031: jne short L003d
    L0033: mov byte ptr [ecx+0xc], 0
    L0037: xor eax, eax
    L0039: pop esi
    L003a: pop edi
    L003b: pop ebp
    L003c: ret
    L003d: mov ecx, 0x6005e28
    L0042: call 0x05b630d0
    L0047: mov esi, eax
    L0049: mov ecx, 1
    L004e: mov edx, 0x1a44c040
    L0053: call 0x72184e00
    L0058: mov edx, eax
    L005a: mov ecx, esi
    L005c: call System.InvalidOperationException..ctor(System.String)
    L0061: mov ecx, esi
    L0063: call 0x72178100
    L0068: int3

Edit

Use a longer run to get more consistent results - the difference between the two results was about the same as the overhead cost of running the benchmark. I changed the number of inner iterations to 4096.

Use a throw helper which is not inlined as it will help the containing method be inlined.

No need to use inlining hint for JIT if the method is small and does not contain a throw. In fact the inlining hint does not do anything for the version which has a string constructor but it does force inlining when the parameterless constructor is used.

It's perhaps strangely coincidental that the throw helpers here all seem to take only a constant enum as a parameter which is then transformed into a string inside the throw helper.

Use BenchmarkDotNet dissasembly diagnoser to get an idea of when inlining actually happens.

Another blog entry with more information on how methods are considered for inlining: https://egorbo.com/how-inlining-works.html

If I simply move the contents of "ThrowException" inside of "EnsureReadSize" and remove the string, it again runs equally as fast as it does with the separate method. — Dennis19901, Feb 24 '21 at 23:43
After reading through the GitHub issue page, that's not what's happening here. When explicitly NOT inlining either the EnsureReadSize or the Throw method, the performance drops even more. *"This won't show up on a microbenchmark of just this method"* The benchmark isn't ran over just this code. I even noted that this is regarding the entire flow of the software. It just happens that this is a very critical path of the code. — Dennis19901, Feb 25 '21 at 00:15
I mentioned the fact the perf difference would only be measurable in the context of the calling methods because without a reproducible benchmark any answer is going to be speculation. Is there any way you could provide one? — Slugart, Feb 25 '21 at 05:50
That being said it is very strange/interesting that the string in the constructor of the exception would have an impact. — Slugart, Feb 25 '21 at 05:51
Most certainly! https://github.com/DennisCorvers/BitSerializer/blob/master/Bench/SerializeBench.cs Line 69 is the Benchmark in question. It is currently implemented via the "slow" as of commit 33 Just removing the string from the currently implementation should give the performance difference. — Dennis19901, Feb 25 '21 at 08:22
I've updated my answer and have created a branch off my fork which I used to investigate the issue. https://github.com/MendelMonteiro/BitSerializer/tree/investigate-inlining. It's quite puzzling and I do not know why the JIT treats the two exceptions differently. — Slugart, Feb 25 '21 at 19:55
@Dennis19901 I've tried to isolate the behaviour to a simpler test suite and the weirdness is even more apparent. Checkout the code size and the .NET5 results: https://gist.github.com/MendelMonteiro/d545ccb0afc360ad173b5c3d8e9ed664 — Slugart, Feb 27 '21 at 13:20
Why are the non-inlined runs so much slower? For my own Benchmarks, AggressiveInlining or NoInlining makes no difference. And are you sure that code size parameter is correct? That seems extremely unlikely. — Dennis19901, Feb 27 '21 at 14:19
You can see the impact of the aggressive inlining hint on the assembly code generated. It seems like the string constructor prevents inlining when no hint is provided. I agree that the code size result looks weird, I'm not sure why it blows out for the default constructor. — Slugart, Feb 27 '21 at 15:04

Inline string performance

1 Answers1

Edit