33

I've already profiled, and am now looking to squeeze every possible bit of performance possible out of my hot-spot.

I know about [MethodImplOptions.AggressiveInlining] and the ProfileOptimization class. Are there any others?


[Edit] I just discovered [TargetedPatchingOptOut] as well. Nevermind, apparently that one is not needed.

Community
  • 1
  • 1
BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283

2 Answers2

36

Yes there are more tricks :-)

I've actually did quite a bit of research on optimizing C# code. So far, these are the most significant results:

  1. Func's and Action's that are passed directly are often inlined by the JIT'ter. Note that you shouldn't store them as variable, because they are then called as delegates. See also this post for more details.
  2. Be careful with overloads. Calling Equals without using IEquatable<T> is usually a bad plan - so if you use f.ex. a hash, be sure to implement the right overloads and interfaces, because it'll safe you a ton of performance.
  3. Generics called from other classes are never inlined. The reason for this is the "magic" outlined here.
  4. If you use a data structure, make sure to try using an array instead :-) Really, these things are fast as hell compared to ... well, just about anything I suppose. I've optimized quite a bit of things by using my own hash tables and using arrays instead of list's.
  5. In a lot of cases, table lookups are faster than computing things or using constructions like vtable lookups, switches, multiple if statements and even calculations. This is also a good trick if you have branches; failed branch prediction can often become a big pain. See also this post - this is a trick I use quite a lot in C# and it works great in a lot of cases. Oh, and lookup tables are arrays of course.
  6. Experiment with making (small) classes structs. Because of the nature of value types, some optimizations are different for struct's than for class'es. For example, method calls are simpler, because the compiler knows exactly what method is going to get called. Also arrays of structs are usually faster than arrays of classes, because they require 1 memory operation less per array operation.
  7. Don't use multi-dimensional arrays. While I prefer Foo[], even Foo[][] is normally faster than Foo[,].
  8. If you're copying data, prefer Buffer.BlockCopy over Array.Copy any day of the week. Also be cautious around strings: string operations can be a performance drainer.

There also used to be a guide called "optimization for the intel pentium processor" with a large number of tricks (like shifting or multiplying instead of dividing). While the compiler does a fine effort nowadays, this also sometimes helps a bit.

Of course these are just optimizations; the biggest performance gains are usually the result of changing the algorithm and/or data structure. Be sure to check out which options are available to you and don't restrict yourself too much by the .NET framework... also I have a natural tendency to distrust the .NET implementation until I've checked the decompiled code by myself... there's a ton of stuff that could have been implemented much faster (most of the times for good reasons).

HTH


Alex pointed out to me that Array.Copy is actually faster according to some people. And since I really don't know what has changed over the years, I decided that the only proper course of action is to create a fresh new benchmark and put it to the test.

If you're just interested in the results, go down. In most cases the call to Buffer.BlockCopy clearly outperforms Array.Copy. Tested on an Intel Skylake with 16 GB memory (>10 GB free) on .NET 4.5.2.

Code:

static void TestNonOverlapped1(int K)
{
    long total = 1000000000;
    long iter = total / K;
    byte[] tmp = new byte[K];
    byte[] tmp2 = new byte[K];
    for (long i = 0; i < iter; ++i)
    {
        Array.Copy(tmp, tmp2, K);
    }
}

static void TestNonOverlapped2(int K)
{
    long total = 1000000000;
    long iter = total / K;
    byte[] tmp = new byte[K];
    byte[] tmp2 = new byte[K];
    for (long i = 0; i < iter; ++i)
    {
        Buffer.BlockCopy(tmp, 0, tmp2, 0, K);
    }
}

static void TestOverlapped1(int K)
{
    long total = 1000000000;
    long iter = total / K;
    byte[] tmp = new byte[K + 16];
    for (long i = 0; i < iter; ++i)
    {
        Array.Copy(tmp, 0, tmp, 16, K);
    }
}

static void TestOverlapped2(int K)
{
    long total = 1000000000;
    long iter = total / K;
    byte[] tmp = new byte[K + 16];
    for (long i = 0; i < iter; ++i)
    {
        Buffer.BlockCopy(tmp, 0, tmp, 16, K);
    }
}

static void Main(string[] args)
{
    for (int i = 0; i < 10; ++i)
    {
        int N = 16 << i;

        Console.WriteLine("Block size: {0} bytes", N);

        Stopwatch sw = Stopwatch.StartNew();

        {
            sw.Restart();
            TestNonOverlapped1(N);

            Console.WriteLine("Non-overlapped Array.Copy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);
            GC.Collect(GC.MaxGeneration);
            GC.WaitForFullGCComplete();
        }

        {
            sw.Restart();
            TestNonOverlapped2(N);

            Console.WriteLine("Non-overlapped Buffer.BlockCopy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);
            GC.Collect(GC.MaxGeneration);
            GC.WaitForFullGCComplete();
        }

        {
            sw.Restart();
            TestOverlapped1(N);

            Console.WriteLine("Overlapped Array.Copy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);
            GC.Collect(GC.MaxGeneration);
            GC.WaitForFullGCComplete();
        }

        {
            sw.Restart();
            TestOverlapped2(N);

            Console.WriteLine("Overlapped Buffer.BlockCopy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);
            GC.Collect(GC.MaxGeneration);
            GC.WaitForFullGCComplete();
        }

        Console.WriteLine("-------------------------");
    }

    Console.ReadLine();
}

Results on x86 JIT:

Block size: 16 bytes
Non-overlapped Array.Copy: 4267.52 ms
Non-overlapped Buffer.BlockCopy: 2887.05 ms
Overlapped Array.Copy: 3305.01 ms
Overlapped Buffer.BlockCopy: 2670.18 ms
-------------------------
Block size: 32 bytes
Non-overlapped Array.Copy: 1327.55 ms
Non-overlapped Buffer.BlockCopy: 763.89 ms
Overlapped Array.Copy: 2334.91 ms
Overlapped Buffer.BlockCopy: 2158.49 ms
-------------------------
Block size: 64 bytes
Non-overlapped Array.Copy: 705.76 ms
Non-overlapped Buffer.BlockCopy: 390.63 ms
Overlapped Array.Copy: 1303.00 ms
Overlapped Buffer.BlockCopy: 1103.89 ms
-------------------------
Block size: 128 bytes
Non-overlapped Array.Copy: 361.18 ms
Non-overlapped Buffer.BlockCopy: 219.77 ms
Overlapped Array.Copy: 620.21 ms
Overlapped Buffer.BlockCopy: 577.20 ms
-------------------------
Block size: 256 bytes
Non-overlapped Array.Copy: 192.92 ms
Non-overlapped Buffer.BlockCopy: 108.71 ms
Overlapped Array.Copy: 347.63 ms
Overlapped Buffer.BlockCopy: 353.40 ms
-------------------------
Block size: 512 bytes
Non-overlapped Array.Copy: 104.69 ms
Non-overlapped Buffer.BlockCopy: 65.65 ms
Overlapped Array.Copy: 211.77 ms
Overlapped Buffer.BlockCopy: 202.94 ms
-------------------------
Block size: 1024 bytes
Non-overlapped Array.Copy: 52.93 ms
Non-overlapped Buffer.BlockCopy: 38.84 ms
Overlapped Array.Copy: 144.39 ms
Overlapped Buffer.BlockCopy: 154.09 ms
-------------------------
Block size: 2048 bytes
Non-overlapped Array.Copy: 45.64 ms
Non-overlapped Buffer.BlockCopy: 30.11 ms
Overlapped Array.Copy: 118.33 ms
Overlapped Buffer.BlockCopy: 109.16 ms
-------------------------
Block size: 4096 bytes
Non-overlapped Array.Copy: 30.93 ms
Non-overlapped Buffer.BlockCopy: 30.72 ms
Overlapped Array.Copy: 119.73 ms
Overlapped Buffer.BlockCopy: 104.66 ms
-------------------------
Block size: 8192 bytes
Non-overlapped Array.Copy: 30.37 ms
Non-overlapped Buffer.BlockCopy: 26.63 ms
Overlapped Array.Copy: 90.46 ms
Overlapped Buffer.BlockCopy: 87.40 ms
-------------------------

Results on x64 JIT:

Block size: 16 bytes
Non-overlapped Array.Copy: 1252.71 ms
Non-overlapped Buffer.BlockCopy: 694.34 ms
Overlapped Array.Copy: 701.27 ms
Overlapped Buffer.BlockCopy: 573.34 ms
-------------------------
Block size: 32 bytes
Non-overlapped Array.Copy: 995.47 ms
Non-overlapped Buffer.BlockCopy: 654.70 ms
Overlapped Array.Copy: 398.48 ms
Overlapped Buffer.BlockCopy: 336.86 ms
-------------------------
Block size: 64 bytes
Non-overlapped Array.Copy: 498.86 ms
Non-overlapped Buffer.BlockCopy: 329.15 ms
Overlapped Array.Copy: 218.43 ms
Overlapped Buffer.BlockCopy: 179.95 ms
-------------------------
Block size: 128 bytes
Non-overlapped Array.Copy: 263.00 ms
Non-overlapped Buffer.BlockCopy: 196.71 ms
Overlapped Array.Copy: 137.21 ms
Overlapped Buffer.BlockCopy: 107.02 ms
-------------------------
Block size: 256 bytes
Non-overlapped Array.Copy: 144.31 ms
Non-overlapped Buffer.BlockCopy: 101.23 ms
Overlapped Array.Copy: 85.49 ms
Overlapped Buffer.BlockCopy: 69.30 ms
-------------------------
Block size: 512 bytes
Non-overlapped Array.Copy: 76.76 ms
Non-overlapped Buffer.BlockCopy: 55.31 ms
Overlapped Array.Copy: 61.99 ms
Overlapped Buffer.BlockCopy: 54.06 ms
-------------------------
Block size: 1024 bytes
Non-overlapped Array.Copy: 44.01 ms
Non-overlapped Buffer.BlockCopy: 33.30 ms
Overlapped Array.Copy: 53.13 ms
Overlapped Buffer.BlockCopy: 51.36 ms
-------------------------
Block size: 2048 bytes
Non-overlapped Array.Copy: 27.05 ms
Non-overlapped Buffer.BlockCopy: 25.57 ms
Overlapped Array.Copy: 46.86 ms
Overlapped Buffer.BlockCopy: 47.83 ms
-------------------------
Block size: 4096 bytes
Non-overlapped Array.Copy: 29.11 ms
Non-overlapped Buffer.BlockCopy: 25.12 ms
Overlapped Array.Copy: 45.05 ms
Overlapped Buffer.BlockCopy: 47.84 ms
-------------------------
Block size: 8192 bytes
Non-overlapped Array.Copy: 24.95 ms
Non-overlapped Buffer.BlockCopy: 21.52 ms
Overlapped Array.Copy: 43.81 ms
Overlapped Buffer.BlockCopy: 43.22 ms
-------------------------
Community
  • 1
  • 1
atlaste
  • 30,418
  • 3
  • 57
  • 87
  • You are not right in paragraph 8 at least, becuase they are identical. So I prefer `Array.Copy` due to type safety. You can find multiple `easy-to-google` benchmarks on this site yourself, if you don't believe. – Alex Zhukovskiy May 30 '16 at 15:22
  • @AlexZhukovskiy About #8: I haven't tested these recently, this was the case in 2013. The outcome changed a bit over the years and there were differences between the x86/x64 JIT's. Nowadays I believe they are comparable in performance, which means I would nowadays prefer `Array.Copy`. – atlaste May 30 '16 at 15:39
  • AFAIR, these benchmark also was posted in '13 or nearly. And I'm defenitly sure that it's code didn't change at all starting from .Net 4.0, and probably from 2.0. – Alex Zhukovskiy May 30 '16 at 15:50
  • @AlexZhukovskiy I really couldn't tell you about that: iirc they're both internal calls, and since I don't work at Microsoft I really cannot tell what changed. I only remember running the benchmarks. If you really want to know, I would simply run the benchmarks; I'd be happy to update the post if they tell a different story. – atlaste May 30 '16 at 16:01
  • [Here](http://stackoverflow.com/a/7069583/2559709) you can find some benchmarks, dated Aug, 11. These methods are internal ones, already tuned enough and I do not expect different result for different target frameworks. If your opinion differs, please povide some benchs. Linkes benchmark wan't good enough (no statistical data/manual `Stopwatch` checking code and so on), but I expect similar results for proper bench. – Alex Zhukovskiy May 30 '16 at 17:15
  • @AlexZhukovskiy As I said, I wasn't sure, so I simply put it to the test. I've added both the test and the appropriate code to the answer. – atlaste May 30 '16 at 18:13
  • Thank you for results. I already upvoted your answer so I cannot do it twice. But you shall know that I did it :) I'l link my results in some days if you will, now I just haven't an access to proper machine. – Alex Zhukovskiy May 31 '16 at 13:22
  • @atlaste take a look a MSIL instruction cpblkhttps://learn.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.cpblk?view=netframework-4.7.2 – TakeMeAsAGuest Mar 05 '19 at 12:11
32

You've exhausted the options added in .NET 4.5 to affect the jitted code directly. Next step is to look at the generated machine code to spot any obvious inefficiencies. Do so with the debugger, first prevent it from disabling the optimizer. Tools + Options, Debugging, General, untick the "Suppress JIT optimization on module load" option. Set a breakpoint on the hot code, Debug + Disassembly to look at it.

There are not that many to consider, the jitter optimizer in general does an excellent job. One thing to look for is failed attempts at eliminating an array bounds check, the fixed keyword is an unsafe workaround for that. A corner case is a failed attempt at inlining a method and the jitter not using cpu registers effectively, an issue with the x86 jitter and fixed with MethodImplOptions.NoInlining. The optimizer is not terribly efficient at hoisting invariant code out of a loop, but that's something you'd almost always first consider when staring at the C# code when looking for ways to optimize it.

The most important thing to want to know is when you are done and just can't hope to make it any faster. You can only really get there by comparing apples and oranges and writing the hot code in native code using C++/CLI. Make sure that this code is compiled with #pragma unmanaged in effect so it gets the full optimizer love. There's a cost associated with switching from managed code to native code execution so do make sure the execution time of the native code is substantial enough. This is otherwise not necessarily easy to do and you certainly won't have a guarantee for success. Albeit that knowing you are done can save you a lot of time stumbling into dead alleys.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • 3
    Thanks Hans - not exactly what I was looking for, but very helpful nonetheless :) If I find the JIT-er is using registers inefficiently, is there anything I can really do about that? – BlueRaja - Danny Pflughoeft May 01 '13 at 12:22
  • 2
    @BlueRaja-DannyPflughoeft I'd try to make variables local instead of class variables if you haven't done already. For the latter the compiler cannot know for sure that it won't be accessed by some other thread, while the former can be safely put in a register. Also the order in which you declare local variables seems to matter - so these are two things to consider. – atlaste Jun 05 '13 at 08:56
  • @atlaste Absolutely spot on. I develop real-time trading applications and have just done an optimisation sweep whereby I lifted all member variables that were accessed more than once in a method into local variables. This allowed the JIT compiler to enregister them and led to a ~10% performance boost – 0b101010 May 21 '15 at 15:12