Manually implementing high performance algorithms in .NET

Question

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#.

Apart from needing to add an extra range check on the left/right variables before the recursive call, it appears to work quite well.

I knew beforehand that the framework provides a built-in Quicksort implementation in List<>.Sort (via Array.Sort). So I tried some basic profiling to compare performance. Results: The built-in List<>.Sort method, operating on the same lists, performs about 10 times faster than my own manual implementation.

Using reflector, I found that the actual sorting in List<>.Sort is implemented in external code, not IL (in a function named tryszsort()).

Looking at my own Quicksort implementation I would expect that replacing the recursive calls with iteration might give some improvement. Also, disabling array bounds checking (if possible) could also give some benefits. Maybe this would get some way closer to the built-in implementation but I'm not confident.

So my question: Is it realistic to expect performance in an optimised algorithm (written in .NET IL, jitted to native code) can compete with performance of an externally implemented algorithm?

Once again, I realise Quicksort is provided as part of the framework, this was just a learning experience for me. However there are also many algorithms (CRC32 comes to mind) that are not provided, but still could be of much value to many applications. Here's a related question regarding implementing CRC32 in .NET and performance issues.

So if you need to implement such an algorithm in .NET, what are the major performance considerations to understand, so that your algorithm can at least approach the performance of external code?

[Update]

I have improved execution speed to within about 10% of the built in Array.Sort by changing the algorithm to operate on a simple array of Int, instead of List. In Reflector, I can see this avoids a Callvirt() operation on every get or set on the list. I thought this might improve things, but I'm surprised by how much.

I've always been under the impression that most quick-sort implementations will switch to merge sort after descending a predetermined number of levels of recursion. Would you mind posting your code to see areas for improvement? — Juliet, Jan 27 '10 at 00:32

Ludovic Chabant · Accepted Answer · 2009-11-14T15:47:54.070

7

By using non-recursive code and, especially, by using "unsafe" blocks and pointer arithmetic (if applicable) you could actually see a x5 or x10 performance gain with an algorithm written in C#. As always with performance (and even more when dealing with a managed environment), you never quite know until you try it and benchmark it.

Now, generally speaking, you should mostly write stuff in C#, then optimize it, optimize it some more, and if it's still not good enough, identify the exact critical piece of code and port it to C++ (while being careful about limiting the number of managed/native call boundaries).

edited Nov 14 '09 at 15:47

answered Nov 13 '09 at 05:46

Ludovic Chabant

1,513
11
12

Interesting points on unsafe mode. I tried to avoid this by default, but in some cases pointer arithmetic might be the only option. I agree completely about the optimization approach, I also try not to get caught up on micro optimizations; improving the overall design may give much better approach improvements. – Ash Nov 13 '09 at 10:58
+1, but could you elaborate on how an unchecked block would increase performance here? Because unchecked is the default, and I don't think it has any effect on array bounds checks. – JulianR Nov 13 '09 at 19:34
4

Note that porting to C++ won't always give a perf benefit, as there's a penalty for managed-to-unmanaged calls that go through P/Invoke, because of calling convention mismatch, and the fact that both cdecl and stdcall push arguments to register. Most of `extern` methods in BCL use "fast" calling convention which relies on native code implementing the method knowing implementation details of the CLR, so they don't pay that price. – Pavel Minaev Nov 13 '09 at 20:20
Good points Pavel. That's why I said to be careful about native/managed boundaries, but I didn't explain it. Julian, I don't think "unchecked" is the default, as you get array out of bounds exceptions if you use incorrect indexes. – Ludovic Chabant Nov 14 '09 at 00:05
1

Arrays are always bounds checked, unless the compiler can deduce it's not necessary or if you use pointer arithmetic. Unchecked blocks have no effect here AFAIK. What I mean by "unchecked" by default" is that overflows of primitive types by default do not generate exceptions unless you wrap it in a `checked` block or use the compiler option. And when using the compiler option, `unchecked` allows the block to behave like the default. Unless I'm mistaken, `unchecked` has no role in performance optimizations. – JulianR Nov 14 '09 at 00:29
Ah yes, I got the "unchecked" thing wrong again (yes, it's not the first time I think it does something else :) ). Thanks! – Ludovic Chabant Nov 14 '09 at 15:52

score 3 · Answer 2 · answered Nov 13 '09 at 05:55

3

Just out of curiosity, as despite my 9 years of experience with .NET, I still constantly make this mistake: Did you compile your code in Release mode with optimizations on? Debug code performs significantly worse than optimized release code.

Assuming you DID compile in release mode, there shouldn't be a huge difference in performance if you implement the algorithm similarly (i.e. iterative vs. iterative or recursive vs. recursive). If you wish to see the .NET implementation and figure out, you can download the SSCLI, Share-Source Common Language Infrastructure. This is Microsoft's publicly availble ECMA-standard compliant CLI implementation. It is not 100% of the .NET framework we all know and love, but it is a significant portion of it. It can provide a lot of information that Reflector can't, including internal implementations. All types of code is available, including C#, C++, and even some Assembler in a couple cases.

answered Nov 13 '09 at 05:55

jrista

32,447
15
90
130

Yes, I compiled in Release mode, however I didn't explicitly check for any optimizations, I assumed setting Release is enough. I'll have to look at this, so thanks for the tip. Thanks aso for the link to the SSCLI too. Interesting to see if/how Sort is implemented. – Ash Nov 13 '09 at 10:51
*"Debug code performs significantly worse than optimized release code."* - Actually, this is not true for .Net; the performance for both is usually comparable. What really matters is whether or not you have the debugger attached - the JIT (which does most of the optimizations) does not optimize by default when the debugger is attached (ie. when running in visual studio). You can change this in VS under `Tools->Options->Debugging->Supress JIT Optimization` – BlueRaja - Danny Pflughoeft May 01 '13 at 12:17
@BlueRaja: That is not actually true. JIT Optimization is an assembly metadata flag, and it disables optimizations entirely, debugging or otherwise. I have performed some fairly extensive testing, particularly with WCF and ASP.NET, with Debug and Release modes, no debugger attached. Debug mode, JIT Optimizations disabled, is significantly slower. For example, a simple WCF calculator service (add, sub, div, mul) operated at ~180msg/s when Debug mode was enabled...~30,000msg/s when Release mode was enabled. NO DEBUGGER was attached. Similar differences with some aspects of ASP.NET. – jrista May 01 '13 at 14:20
Yes, sorry, I spoke too tongue-in-cheek. What I meant to say was, what really matters is **if the JIT Optimizer is enabled.** This is disabled by default when the debugger is attached. It's partially-enabled in Debug build, but can be fully re-enabled with some [.ini flags](http://goo.gl/GAuMY). However, the reason WCF is *so* much slower in debug build has nothing to do with any of this - WCF adds diagnostic code/messages when the `DEBUG` preprocessor-symbol is found. It also uses the VS WCF service host (rather than IIS) which is meant for strictly for development. – BlueRaja - Danny Pflughoeft May 01 '13 at 17:30
Hmm, our tests were hosted in IIS 7.0 at the time... I don't think what the cause of slowness during debug really matters though. I know a lot of Microsoft code involves a lot of tracing and other diagnostic junk when Debug "mode" is enabled (and all the compiler flags and other settings that come with it.) Since the vast majority of .NET developer code is built on the .NET Framework, all of Microsoft's diagnostic junk is weaved into your own code whenever you call it. There are a lot of reasons to make sure that you build Release code for final prod...otherwise the perf hit could be BIG! – jrista May 01 '13 at 19:06

score 1 · Answer 3 · answered Nov 13 '09 at 20:15

1

Make sure you're comparing apples and apples.

When sorting, it's possible for the Compare function to be dominant, and that might differ between the implementations.

Assuming the Compare function in both cases is fast enough not to be an issue, then the time can be dominated by stuff like array bounds checking, which can easily make a big difference.

answered Nov 13 '09 at 20:15

Mike Dunlavey

40,059
14
91
135

Good tip. In this case I'm currently just using basic Int32 comparisons. I did find that sorting a simple int array is much faster than a List. It appears that there is a "Callvirt" for every get and set on the List. Removing this by using an array halved (at least) the execution time. – Ash Nov 14 '09 at 04:14

Manually implementing high performance algorithms in .NET

3 Answers3