Writing performance critical C# code in C++

Question

I'm currently working on some performance critical code, and I have a particular situation where I'd love to write the whole application in C#, but performance reasons mean C++ ends up being FAR faster.

I did some benchmarking on two different implementations of some code (One in C#, another in C++) and the timings showed that the C++ version was 8 times faster, both versions in release mode and with all optimizations enabled. (Actually, the C# had the advantage of being compiled as 64-bit. I forgot to enable this in the C++ timing)

So I figure, I can write the majority of the code base in C# (Which C# makes very easy to write), and then write native versions of things where the performance is critical. The particular code piece I tested in C# and C++ was one of the critical areas where > 95% of processing time was spent.

What's the recommended wisdom on writing native code here though? I've never written a C# application that calls native C++, so I have no idea what to do. I want to do this in a way that minimizes the cost of having to do the native calls as much as possible.

Thanks!

Edit: Below is most of the code that I'm actually trying to work on. It's for a n-body simulation. 95-99% of the CPU time will be spent in Body.Pairwise().

class Body
{
    public double Mass;
    public Vector Position;
    public Vector Velocity;
    public Vector Acceleration;

    // snip

    public void Pairwise(Body b)
    {
        Vector dr = b.Position - this.Position;
        double r2 = dr.LengthSq();
        double r3i = 1 / (r2 * Math.Sqrt(r2));

        Vector da = r3i * dr;
        this.Acceleration += (b.Mass * da);
        b.Acceleration -= (this.Mass * da);
    }

    public void Predict(double dt)
    {
        Velocity += (0.5 * dt) * Acceleration;
        Position += dt * Velocity;
    }

    public void Correct(double dt)
    {
        Velocity += (0.5 * dt) * Acceleration;
        Acceleration.Clear();
    }
}

I also have a class that just drives the simulation with the following methods:

    public static void Pairwise(Body[] b, int n)
    {
        for (int i = 0; i < n; i++)
            for (int j = i + 1; j < n; j++)
                b[i].Pairwise(b[j]);
    }

    public static void Predict(Body[] b, int n, double dt)
    {
        for (int i = 0; i < n; i++)
            b[i].Predict(dt);
    }

    public static void Correct(Body[] b, int n, double dt)
    {
        for (int i = 0; i < n; i++)
            b[i].Correct(dt);
    }

The main loop looks just like:

for (int s = 0; s < steps; s++)
{
    Predict(bodies, n, dt);
    Pairwise(bodies, n);
    Correct(bodies, n, dt);
}

The above is just the bare minimum of a larger application I'm actually working on. There's some more things going on, but the most performance critical things occur in these three functions. I know the pairwise function is slow (It's n^2), and I do have other methods that are faster (Barnes-hutt for one, which is n log n) but that's beyond the scope of what I'm asking for in this question.

The C++ code is nearly identical:

struct Body
{
public:
    double Mass;
    Vector Position;
    Vector Velocity;
    Vector Acceleration;

    void Pairwise(Body &b)
    {
        Vector dr = b.Position - this->Position;
        double r2 = dr.LengthSq();
        double r3i = 1 / (r2 * sqrt(r2));

        Vector da = r3i * dr;
        this->Acceleration += (b.Mass * da);
        b.Acceleration -= (this->Mass * da);
    }

    void Predict(double dt)
    {
        Velocity += (0.5 * dt) * Acceleration;
        Position += dt * Velocity;
    }

    void Correct(double dt)
    {
        Velocity += (0.5 * dt) * Acceleration;
        Acceleration.Clear();
    }
};

void Pairwise(Body *b, int n)
{
    for (int i = 0; i < n; i++)
        for (int j = i + 1; j < n; j++)
            b[i].Pairwise(b[j]);
}

void Predict(Body *b, int n, double dt)
{
    for (int i = 0; i < n; i++)
        b[i].Predict(dt);
}

void Correct(Body *b, int n, double dt)
{
    for (int i = 0; i < n; i++)
        b[i].Correct(dt);
}

Main loop:

for (int s = 0; s < steps; s++)
{
    Predict(bodies, n, dt);
    Pairwise(bodies, n);
    Correct(bodies, n, dt);
}

There also exists a Vector class, that works just like a regular mathematical vector, which I'm not including for brevity.

C# shouldn't be slower. you can achieve faster code using unchecked blocks to avoid overflow checking and other stuff. — Yochai Timmer, Apr 09 '11 at 21:09
http://stackoverflow.com/questions/5326269/is-c-really-slower-than-say-c — Yochai Timmer, Apr 09 '11 at 21:10
@Yochai: I already tried that, by wrapping all my arithmetic in unsafe blocks. It boils down to a few functions which do lots of floating point math, but I'm doing -a lot- of computation each second. Also, I'd like to know how to call C++ from C#, as I already do have some existing code that's written in C++. Some of the newer code I'd like to write isn't computation related and is easier to write in C# though. — Mike Bailey, Apr 09 '11 at 21:11
@Yochai: I'm not so sure that unchecked code is faster than checked code. I've witnessed situations in which it's consistently marginally slower, though I admit I can't explain why. — user541686, Apr 09 '11 at 21:13
@Mike Bantegui, this is an old question, but you should check out C++ AMP for your N Body simulation. It can give an even better performance boost by letting run it on GPU hardware. Should be simple since you are already calling unmanaged code. — Kratz, Nov 21 '14 at 14:27

Hans Passant · Accepted Answer · 2011-04-10T06:20:46.050

8

You'll need to interface to the native code. You could put it in a DLL and pinvoke. Okay when you don't transition very often and the interface is thin. The most flexible and speediest solution is to write a ref class wrapper in the C++/CLI language. Have a look at this magazine article for an introduction.

Last but not least, you really ought to profile the C# code. A factor of 8 is quite excessive. Don't get started on this until you at least have half an idea why it is that slow. You don't want to repro the cause in the C++ code, that would ruin a week of work.

And beware of the wrong instincts. 64-bit code is not actually faster, it is usually a bit slower than x86 code. It's got a bunch of extra registers which is very nice. But all the pointers are double the size and you don't get double the cpu cache. .

edited Apr 10 '11 at 06:20

answered Apr 09 '11 at 21:13

Hans Passant

922,412
146
1,693
2,536

I redid the timings, making sure the input data was identical, and it's still 5x slower. Are there any things I should watch out for that could easily sap performance in C#? – Mike Bailey Apr 09 '11 at 21:35
Almost doubling the performance by simply redoing the timings would be a Red Flag in my book. There's no generic play-book for optimizing C# code, only good profilers that show you where the cycles go. Obvious mistakes are profiling the Debug build or with the debugger attached. – Hans Passant Apr 09 '11 at 21:46
I think the "twice as fast" came from the fact that I ran through visual studio. I can consistently get timings of 500 ms (C#) and 2000 ms (C++) for a tiny computation. I did profile and did confirm the one method where all of the cycles were spent, which was identical in both platforms and had the same percentage of CPU time (99%). – Mike Bailey Apr 09 '11 at 21:56
@Mike Bantegui: Hans is right. You gotta find out what's *really* going on. Of course, when I hear "where the cycles go", and "where all of the cycles were spent" I'm immediately suspicious, because all too often the biggest time wasters are innocent-looking or even invisible function calls, which disguise themselves by causing the time to be spent elsewhere. I would encourage you to get down and dirty - step it at the instruction level, or do [this](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024). – Mike Dunlavey Apr 10 '11 at 03:50

score 2 · Answer 2 · answered Apr 09 '11 at 21:34

You have two choices: P/Invoking and C++/CLI.

P/Invoking

By using P/Invoke, or Platform Invoke, it is possible for .NET (and therefore C#) to call into unmanaged code (your C++ code). It can be a bit overwhelming, but it is definitely possible to have your C# code call into performance critical C++ code.

Some MSDN links to get you started:

Basically, you will create a C++ DLL that has defined all the unmanaged functions you want to call from C#. Then, in C# you will use the DllImportAttribute to import that function into C#.

For instance, you have a C++ project that creates a Monkey.dll with the following function:

extern "C" __declspec(dllexport) void FastMonkey();

You will then have a definition in C# as follows:

class NativeMethods
{
    [DllImport("Monkey.dll", CallingConvention=CallingConvention.CDecl)]
    public static extern void FastMonkey();
}

You can then call the C++ function in C# by calling NativeMethods.FastMonkey.

Few common gotchas and notes:

Spend time learning Interop Marshaling. Understanding this will greatly help creating proper P/Invoking definitions.
The default calling convention is StdCall, but C++ will default to CDecl.
The default character set is ANSI, so if you want to marshal Unicode strings, you will have to update your DllImport definition (see MSDN - DllImport.CharSet documentation).
http://www.pinvoke.net/ is a useful resource for knowing how to P/Invoke standard Windows functions call. You can also use that for a clue how to marshal something if you know of a Windows function call that is similar.

C++/CLI

C++/CLI is a series of extensions to C++ created by Microsoft to create .NET assemblies with C++. C++/CLI also allows you to mix unmanaged and managed code together into a "mixed" assembly. You can create a C++/CLI assembly that contains both your performance critical code and any .NET class wrapper around it you want.

For more information with C++/CLI, I recommended starting with MSDN - Language Features for Targeting the CLR and MSDN - Native and .NET Interoperability.

I recommend you start with the P/Invoking route. I have found having a clear separation between unmanaged and managed code helps to simplify things.

score 1 · Answer 3 · answered Jun 09 '11 at 13:39

1

In C#, is Vector a class or struct? I suspect it's a class, and Arthur Stankevich hit the nail on the head with his observation that you may be allocating many of these. Try making Vector a struct, or reusing the same Vector objects.

answered Jun 09 '11 at 13:39

Keith Robertson

791
7
13

Yes, I'm wondering the same thing. I've seen many times that a Vector type is implemented in the Java way, which cause a lot of allocations. While in C++ the Vector's operations can be mostly inlined. To achieve compariable performance to C++, Vector type should be implemented as struct, and prefer to pass as ref parameter for operations instead of using overloaded operators, so that it can be better inlined by jitter. – Dudu Nov 15 '11 at 09:28

Yochai Timmer · Answer 4 · 2011-04-09T21:19:47.080

0

Easiest way to do it is create C++ ActiveX dlls.

Then you can reference them in the C# project, Visual Studio will create interops that will wrap the ActiveX COM Object.

You can use the interop Code like C# code, no additional wrapping code.

More about AciveX/ C#:

Create and Use a C++ ActiveX component within a .NET environment

edited Apr 09 '11 at 21:19

answered Apr 09 '11 at 21:13

Yochai Timmer

48,127
24
147
185

Is this the most efficient way to this? Or would the P/Invoke more efficient? I don't mind one way being harder, as long as it means I can extract as much performance as possible. – Mike Bailey Apr 09 '11 at 21:33
It's easy to do. I don't think there's a difference in communication overhead. – Yochai Timmer Apr 10 '11 at 04:04

score 0 · Answer 5 · answered Apr 09 '11 at 21:39

0

"I did some benchmarking on two different implementations of some code (One in C#, another in C++) and the timings showed that the C++ version was 8 times faster"

I did some numerical calculation in C#, C++, Java and a bit of F# and the biggest diffrence between C# and C++ was 3.5.

Profile your C# version and find the bottleneck (maybe there are some IO - related problems, unnecessary allocation)

answered Apr 09 '11 at 21:39

Lukasz Madon

14,664
14
64
108

There is no IO bottleneck (no actual IO beyond loading the test data), I re-checked the code and I can confirm that there's one method (on both C# and C++) that does nothing but raw computation, and it takes up 99% of the time. – Mike Bailey Apr 09 '11 at 21:47
Sure, I can post the particular parts that are known to be expensive. It's nothing more than a gravity simulation. Give me a few minutes and I'll update my main post. – Mike Bailey Apr 09 '11 at 23:07

score 0 · Answer 6 · answered Apr 09 '11 at 22:45

P/Invoke is definitely easier than COM Interop for the simple case. However, if you do bigger chunks of a class model in C++, you might really want to consider C++/CLI or COM Interop.

ATL makes you whip up a class in no time, and once the object is instantiated, the invocation overhead is basically as small as with P/Invoke (unless you use dynamic dispatch, IDispatch, but that should be obvious).

Of course, C++/CLI is the very best option there, but that's not going to work everywhere. P/Invoke can be made to work everywhere. COM interop is supported on Mono up to degree

score 0 · Answer 7 · answered May 26 '11 at 01:27

Looks like you are doing a lot of implicit Vector class allocations in your code:

Vector dr = b.Position - this.Position;
...
Vector da = r3i * dr;
this.Acceleration += (b.Mass * da);
b.Acceleration -= (this.Mass * da);

Try reusing already allocated memory.

Writing performance critical C# code in C++

7 Answers7