Performance issue with for loop on the initial run on .NET 7

Question

I'm working on a performance sensitive application and considering moving from .NET 6 to .NET 7.

During comparing these two versions I've found that .NET 7 is slower executing a for loop on the initial run.

Testing is done with two separate console applications with identical code, one on .NET 6 and the other on .NET 7, running in release mode, any CPU.

Test code:

using System.Diagnostics;

int size = 1000000;
Stopwatch sw = new();

//create array
float[] arr = new float[size];
for (int i = 0; i < size; i++)
    arr[i] = i;

Console.WriteLine(AppDomain.CurrentDomain.SetupInformation.TargetFrameworkName);

Console.WriteLine($"\nForLoop1");
ForLoop1();
ForLoop1();
ForLoop1();
ForLoop1();
ForLoop1();

Console.WriteLine($"\nForLoopArray");
ForLoopArray();
ForLoopArray();
ForLoopArray();
ForLoopArray();
ForLoopArray();

Console.WriteLine($"\nForLoop2");
ForLoop2();
ForLoop2();
ForLoop2();
ForLoop2();
ForLoop2();

void ForLoop1()
{
    sw.Restart();

    int sum = 0;
    for (int i = 0; i < size; i++)
        sum++;

    sw.Stop();
    Console.WriteLine($"{sw.ElapsedTicks} ticks ({sum})");
}

void ForLoopArray()
{
    sw.Restart();

    float sum = 0f;
    for (int i = 0; i < size; i++)
        sum += arr[i];

    sw.Stop();
    Console.WriteLine($"{sw.ElapsedTicks} ticks ({sum})");
}

void ForLoop2()
{
    sw.Restart();

    int sum = 0;
    for (int i = 0; i < size; i++)
        sum++;

    sw.Stop();
    Console.WriteLine($"{sw.ElapsedTicks} ticks ({sum})");
}

The console output for the .NET 6 version:

.NETCoreApp,Version=v6.0

ForLoop1
2989 ticks (1000000)
2846 ticks (1000000)
2851 ticks (1000000)
3180 ticks (1000000)
2841 ticks (1000000)

ForLoopArray
8270 ticks (4.9994036E+11)
8443 ticks (4.9994036E+11)
8354 ticks (4.9994036E+11)
8952 ticks (4.9994036E+11)
8458 ticks (4.9994036E+11)

ForLoop2
2842 ticks (1000000)
2844 ticks (1000000)
3117 ticks (1000000)
2835 ticks (1000000)
2992 ticks (1000000)

And the .NET 7 version:

.NETCoreApp,Version=v7.0

ForLoop1
19658 ticks (1000000)
2921 ticks (1000000)
2967 ticks (1000000)
3190 ticks (1000000)
3722 ticks (1000000)

ForLoopArray
20041 ticks (4.9994036E+11)
8342 ticks (4.9994036E+11)
9212 ticks (4.9994036E+11)
8501 ticks (4.9994036E+11)
9726 ticks (4.9994036E+11)

ForLoop2
14016 ticks (1000000)
3008 ticks (1000000)
2885 ticks (1000000)
2882 ticks (1000000)
2888 ticks (1000000)

As you can see, the .NET 6 timings are very similar, whereas the .NET 7 timings show an initial high value (19658, 20041 and 14016).

Fiddling with the environment variables DOTNET_ReadyToRun and DOTNET_TieredPGO just makes things worse.

Why is this and how can it be rectified?

will recommend using benchmarkdotnet even for simple benchmark like that — Drag and Drop, Dec 17 '22 at 10:47
Windows is a multiprocessing environment and just because the first loop is larger than other means that another process in windows is running. It doesn't mean your code nor the compiler is causing the differences. It looks like Microsoft may of fixed bugs in Net 6. Net is managed and there is a large overhead that is added to code when you call a method that adds exception handlers to prevent the computer from going blue screen. — jdweng, Dec 17 '22 at 10:57
So you are concerned about the initial warming up that has a duration of ~2 milliseconds, which is an one-time cost per process? That's what you want to optimize? — Theodor Zoulias, Dec 17 '22 at 10:58
@DragandDrop BenchmarkDotNet adds a layer of complexity and creates a synthetic result. It removes JIT issues and creates an optimal average. I've tested it with BenchmarkDotNet and it did not report the huge delay with the initial run. What I'm more interested in is the difference between .NET 6 and 7, and how the application actually behaves in real life. — jrw, Dec 17 '22 at 11:13
@jdweng It cannot be another process interfering because I've run the test more than ten times, and the difference in behavior between .NET 6 and 7 is very consistent. — jrw, Dec 17 '22 at 11:15
@TheodorZoulias It's not a one time cost, the delay happens three times in the test code. Normally, a method is not called multiple times in succession, so you don't get the benefit of faster execution. In other words, every time a method is called, .NET 7 will be 2-3 times slower than .NET 6. — jrw, Dec 17 '22 at 11:22
The initial delay is the overhead the managed code is adding to call the method. Once the child method is running you do not see the delay until you call another method. There must of been a bug in Net 6 where the managed code was not added. — jdweng, Dec 17 '22 at 11:25
You are right that it's not an one-time cost overall, but it seems more like an one-time cost per method. So if your application has 1,000 methods, and you invoke all of them at least once throughout the lifetime of the process, your benchmark indicates that you'll pay an additional cost of 1,000 x 2 milliseconds, or 2 seconds overall. More tests are needed in order to validate this hypothesis. — Theodor Zoulias, Dec 17 '22 at 11:41
@TheodorZoulias Your hypothesis is correct, I just tested it now. Only the initial method call is slow, all subsequent calls are fast. The subsequent calls don't need to be called in rapid succession, the program can do something else, and when the method is called again, it will be fast. I would still like to know if this is a relatively minor, standalone bug or if it's indicative of more serious performance issues hidden with .NET 7. — jrw, Dec 17 '22 at 12:11

Guru Stron · Accepted Answer · 2023-02-01T21:54:11.477

My guess would be that this can be connected to the new On-Stack Replacement feature introduced in .NET 7. Enabling DOTNET_JitDisasmSummary "on my machine" (Windows Powershell - $env:DOTNET_JitDisasmSummary=1) results in the following output:

ForLoop1
   9: JIT compiled Program:<<Main>$>g__ForLoop1|0_0(byref) [Tier0, IL size=118, code size=291]
  10: JIT compiled Program:<<Main>$>g__ForLoop1|0_0(byref) [Tier1-OSR @0x19, IL size=118, code size=571]
13420 ticks (1000000)
2431 ticks (1000000)
...

ForLoopArray
  11: JIT compiled Program:<<Main>$>g__ForLoopArray|0_1(byref) [Tier0, IL size=129, code size=339]
  12: JIT compiled Program:<<Main>$>g__ForLoopArray|0_1(byref) [Tier1-OSR @0x24, IL size=129, code size=609]
  13: JIT compiled System.SpanHelpers:SequenceCompareTo(byref,int,byref,int) [Tier1, IL size=632, code size=329]
19380 ticks (4.9994036E+11)
10694 ticks (4.9994036E+11)
...

ForLoop2
  14: JIT compiled Program:<<Main>$>g__ForLoop2|0_2(byref) [Tier0, IL size=118, code size=291]
  15: JIT compiled Program:<<Main>$>g__ForLoop2|0_2(byref) [Tier1-OSR @0x19, IL size=118, code size=549]
11720 ticks (1000000)
2549 ticks (1000000)
...

Setting DOTNET_TC_QuickJitForLoops to 0 (env:DOTNET_TC_QuickJitForLoops=0) "reverts" this behaviour (not sure why, because the docs state that default is false, maybe something was changed in .NET 7):

ForLoop1
   8: JIT compiled Program:<<Main>$>g__ForLoop1|0_0(byref) [Tier-0 switched to FullOpts, IL size=118, code size=577]
2590 ticks (1000000)
2535 ticks (1000000)
...

ForLoopArray
   9: JIT compiled Program:<<Main>$>g__ForLoopArray|0_1(byref) [Tier-0 switched to FullOpts, IL size=129, code size=618]
  10: JIT compiled System.SpanHelpers:SequenceCompareTo(byref,int,byref,int) [Tier1, IL size=632, code size=329]
10759 ticks (4.9994036E+11)
10816 ticks (4.9994036E+11)
...

ForLoop2
  11: JIT compiled Program:<<Main>$>g__ForLoop2|0_2(byref) [Tier-0 switched to FullOpts, IL size=118, code size=555]
2446 ticks (1000000)
2509 ticks (1000000)
...

Possibly related discussion on github

P.S.

If your code is performance-sensitive especially startup performance-sensitive possibly it is worth considering to look into Native AOT.

@jrw did you ever try this out on you actual application? If you did and it also shows significant slowdowns please consider opening an issue in the dotnet./runtime repo on github. — Andy Ayers, Jun 08 '23 at 02:14

Performance issue with for loop on the initial run on .NET 7

1 Answers1

Linked