How to do Speedy Complex Arithmetic in C#

Question

I'm working on a C# Fractal Generator project right now that requires lots of Arithmetic with Complex numbers, and I'm trying to think of ways to speed up the math. Below is a simplified set of code that tests the speed of a Mandelbrot calculation using one of three data storage methods, shown in TestNumericsComplex, TestCustomComplex, and TestPairedDoubles. Please understand that the Mandelbrot is just an example, and I intend for future developers to be able to create plug-in fractal formulas.

Basically I see that using System.Numerics.Complex is an ok idea, while using a pair of doubles or a custom Complex struct are passable ideas. I can perform the arithmetic using the gpu, but wouldn't that limit or break portability? I've tried varying the order of the inner loops (i, x, y) to no avail. What else can I do to help speed up the inner loops? Am I running into page fault issues? Would using a fixed-point number system gain me any speed as opposed to the floating-point values?

I'm already aware of Parallel.For in C# 4.0; it is omitted from my code samples for clarity. I'm also aware that C# is not usually a good language for high-performance; I'm using C# to take advantage of Reflection for plugins and WPF for windowing.

using System;
using System.Diagnostics;

namespace SpeedTest {
class Program {
    private const int ITER = 512;
    private const int XL = 1280, YL = 1024;

    static void Main(string[] args) {
        var timer = new Stopwatch();
        timer.Start();
        //TODO use one of these two lines
        //TestCustomComplex();
        //TestNumericsComplex();
        //TestPairedDoubles();
        timer.Stop();
        Console.WriteLine(timer.ElapsedMilliseconds);
        Console.ReadKey();
    }

    /// <summary>
    /// ~14000 ms on my machine
    /// </summary>
    static void TestNumericsComplex() {
        var vals = new System.Numerics.Complex[XL,YL];
        var loc = new System.Numerics.Complex[XL,YL];

        for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
            loc[x, y] = new System.Numerics.Complex((x - XL/2)/256.0, (y - YL/2)/256.0);
            vals[x, y] = new System.Numerics.Complex(0, 0);
        }

        for (int i = 0; i < ITER; i++) {
            for (int x = 0; x < XL; x++)
            for (int y = 0; y < YL; y++) {
                if(vals[x,y].Real>4) continue;
                vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
            }
        }
    }


    /// <summary>
    /// ~17000 on my machine
    /// </summary>
    static void TestPairedDoubles() {
        var vals = new double[XL, YL, 2];
        var loc = new double[XL, YL, 2];

        for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
                loc[x, y, 0] = (x - XL / 2) / 256.0;
                loc[x, y, 1] = (y - YL / 2) / 256.0;
                vals[x, y, 0] = 0;
                vals[x, y, 1] = 0;
            }

        for (int i = 0; i < ITER; i++) {
            for (int x = 0; x < XL; x++)
                for (int y = 0; y < YL; y++) {
                    if (vals[x, y, 0] > 4) continue;
                    var a = vals[x, y, 0] * vals[x, y, 0] - vals[x, y, 1] * vals[x, y, 1];
                    var b = vals[x, y, 0] * vals[x, y, 1] * 2;
                    vals[x, y, 0] = a + loc[x, y, 0];
                    vals[x, y, 1] = b + loc[x, y, 1];
                }
        }
    }


    /// <summary>
    /// ~16900 ms on my machine
    /// </summary>
    static void TestCustomComplex() {
        var vals = new Complex[XL, YL];
        var loc = new Complex[XL, YL];

        for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
            loc[x, y] = new Complex((x - XL / 2) / 256.0, (y - YL / 2) / 256.0);
            vals[x, y] = new Complex(0, 0);
        }

        for (int i = 0; i < ITER; i++) {
            for (int x = 0; x < XL; x++)
            for (int y = 0; y < YL; y++) {
                if (vals[x, y].Real > 4) continue;
                vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
            }
        }
    }

}

public struct Complex {
    public double Real, Imaginary;
    public Complex(double a, double b) {
        Real = a;
        Imaginary = b;
    }
    public static Complex operator + (Complex a, Complex b) {
        return new Complex(a.Real + b.Real, a.Imaginary + b.Imaginary);
    }
    public static Complex operator * (Complex a, Complex b) {
        return new Complex(a.Real*b.Real - a.Imaginary*b.Imaginary, a.Real*b.Imaginary + a.Imaginary*b.Real);
    }
}

}

EDIT

GPU seems to be the only feasible solution; I disregard interoperability with C/C++ because I don't feel the speed up would be significant enough to coerce me to forcing interoperability on future plugins.

After looking into the available GPU options (which I've actually been examining for some time now), I've finally found what I believe is an excellent compromise. I've chosen OpenCL with the hope that most devices will support the standard by the time my program is released. OpenCLTemplate uses cloo to provide an easy-to-understand interface between .Net (for application logic) and "OpenCL C99" (for parallel code). Plugins can include OpenCL kernels for hardware acceleration alongside the standard implementation with System.Numerics.Complex for ease of integration.

I expect the number of available tutorials on writing OpenCL C99 code to grow rapidly as the standard becomes adopted by processor vendors. This keeps me from needing to enforce GPU coding on plugin developers while providing them with a well formulated language should they choose to take advantage of the option. It also means that IronPython scripts will have equal access to GPU acceleration despite being unknown until compile-time, since the code will translate directly through OpenCL.

For anyone in the future interested in integrating GPU acceleration with a .Net project, I highly recommend OpenCLTemplate. There is an admitted overhead of learning OpenCL C99. However, it is only slightly harder than learning an alternative API and will likely have better support from examples and general communities.

"I'm also aware that C# is not usually a good language for high-performance" - that is NOT correct. — Mitch Wheat, Feb 16 '11 at 00:02
You are not going to be able to speed up a single complex add/multiply. Rather, you would need to take advantage of larger calculations, possibly pipelining a series of calcs to a GPU — Mitch Wheat, Feb 16 '11 at 00:05
@Mitch I think what he meant was that you can mostly make C++ code faster than C# with some fancy cleverness that isn't available in C#. — Chris, Feb 16 '11 at 00:08
"that is NOT correct" http://www.wisegeek.com/what-is-the-difference-between-fact-and-opinion.htm When making statements like that, please back up with links or facts. — RichardTheKiwi, Feb 16 '11 at 00:12
@cyberkiwi: presumably you meant that to also apply to the poster. Also, this is a voluntary site; how about YOU go find the links? — Mitch Wheat, Feb 16 '11 at 00:18
http://stackoverflow.com/questions/138361/how-much-faster-is-c-than-c — Mitch Wheat, Feb 16 '11 at 00:19
@Mitch - A statement that merely puts up an argument without facts is.. opinion. It doesn't help anyone reading it, so it adds no value. And yes it applies to both question and comment. But that's what answers are for - to resolve the question; that's why it is a question. — RichardTheKiwi, Feb 16 '11 at 00:20
@cyberkiwi: I suspect that applies to many of your comments as well. — Mitch Wheat, Feb 16 '11 at 00:21
@Mitch - happy for you to point them out and I'll clean them up — RichardTheKiwi, Feb 16 '11 at 00:22

thecoshman · Accepted Answer · 2011-02-17T21:23:10.867

I think your best bet is to look at off loading these calculations to a graphics card. There is openCL that can use graphics cards for this sort of thing, as well as using openGL shaders.

To really take advantage of this, you want to be calculating in parallel. lets say you are wanting to square root (simple I know but the principle is the same) 1 million numbers. On a CPU you can only do one at a time, or work out how many cores you have, reasonable to expect say 8 cores, and have each perform the calculation on a subset of the data.

If you offload your calculation to a graphics card for example, you would 'feed' in you data as say, a bunch of 1/4 million 3D points in space (that's four floats per vertex) and then have a vertex shader calculate the square root of each xyzw of each vertex. a graphics cards has a hell of a lot more cores, even if it was only 100 it can still work on a lot more numbers at once then a CPU.

I can flesh this out with some more info if you want, though I am no expect on use of shaders, but I need to get up to scratch with them any way.

EDIT

looking at this relativeley cheap card an nvidea GT 220 you can see it has 48 'CUDA' cores. These are what you are using when you use things like openCL and shaders.

EDIT 2

Ok, so it seems your fairly interested in using GPU acceleration. I can't help you with using openCL, never looked into it, but I assume it will work much the same openGL/DirectX applications that make use of shaders but with out the actual graphics application. I'm going to talk about the DirectX way of things, as that is what I know (just about) but from my understanding, it is more or less the same all the way for openGL.

First, you need to create a window. as you want cross platform, GLUT is probably the best way to go, its not the best library in the world, but it gives you a window nice and fast. As you are not going to actually show any rendering, you could just make it a tiny window, just big enough to set he title to something like "HARDWARE ACCELERATING".

Once you have your graphics card set up and ready to render stuff with, you get to this stage by following tutorials from here. This will get you to the stage where you can create 3D models and 'animate' them on screen.

Next you want to create a vertex buffer that you populate with input data. a vertex would normally be three (or four) floats. If you values are all independent, that's cool. but if you need to group them together, say if you are in fact working with 2D vectors, then you need to make sure you 'pack' the data correctly. say you want to do maths with 2D vectors, and openGL is working with 3D vectors, then vector.x and vector.y are your actually input vector and vector.z would just be spare data.

You see, the vector shader can only work with one vector at a time, it can't see more then one vector as input, you could look into using a geometry shader which can look at bigger sets of data.

So right, you set up an vertex buffer and pop that over the graphics card. You also need to write a 'vertex shader', this is a text file with a sort of C like language that lets you perform some maths. It is not a full C implementation mind, but it looks enough like C for you to know what your doing. The exact ins and outs of openGL shaders is beyond me, but I am sure a simple tutorial is easy enough to find.

One thing that you are on your own with, is finding out how exactly you can get the output of the vertex shader to go to a second buffer, which is effectively your output. A vertex shader does not change the vertex data in the buffer you set up, that is constant (as far as the shader is concerned) but you can get the shader to output to a second buffer.

your calculation would look something like this

createvertexbuffer()
loadShader("path to shader code", vertexshader) // something like this I think
// begin 'rendering'
setShader(myvertexshader)
setvertexbuffer(myvertexbuffer)
drawpoints() // will now 'draw' your points
readoutputbuffer()

I hope this helps. Like I said, I am still learning this, and even then I am learning the DirectX way of things.

Again, using CUDA limits me to nvidea while OpenCL limits me by the newness of the technology. That is, how can I expect the end user to have a valid OpenCL driver installed on their system? How can we use GPU without running into portability issues? — benjamin.popp, Feb 16 '11 at 00:54
A real answer that doesn't recommend using C/C++? I'm proud of you. — Cody Gray - on strike, Feb 16 '11 at 01:38
don't be mixing openGL and openCL. both are cros platform, and if you are using them for hard ware accelerated calculations, both will be limited by the hardware in your machine, if you GPU dose not have any programmable shaders then you can't do hardware programmable shaders — thecoshman, Feb 16 '11 at 13:29
So if GPU offloading is the best answer for high volume math (which makes sense), what would be the easiest way to integrate it into the program? GPU.Net requires an extra build step and a license, so it would be unavailable for third party plugin development. Brahma seems to be a good option, but I can't find any good tutorials or up-to-date code samples. It looks like I'd be stuck binding to an OpenCL dll and requiring plugins to include OpenCL kernels. Does anyone know any other good, free integration from OpenCL to .Net? — benjamin.popp, Feb 17 '11 at 17:21

score 0 · Answer 2 · answered Jun 19 '12 at 15:03

Making your custom struct mutable I gained 30%. This reduces calls and memory usage

//instead of writing  (in TestCustomComplex())
vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];

//use
vals[x,y].MutableMultiAdd(loc[x,y]);

//defined in the struct as
public void MutableMultiAdd(Complex other)
    {
        var tempReal = (Real * Real - Imaginary * Imaginary) + other.Real;
        Imaginary =( Real * Imaginary + Imaginary * Real )+ other.Imaginary;
        Real = tempReal;
    }

For Matrix Multiply you can also use 'Unsafe { Fixed(){}}' and access your arrays. Using this I gained 15% for TestCustomComplex().

private static void TestCustomComplex()
    {
        var vals = new Complex[XL, YL];
        var loc = new Complex[XL, YL];

        for (int x = 0; x < XL; x++)
            for (int y = 0; y < YL; y++)
            {
                loc[x, y] = new Complex((x - XL / 2) / 256.0, (y - YL / 2) / 256.0);
                vals[x, y] = new Complex(0, 0);
            }

        unsafe
        {
            fixed (Complex* p = vals, l = loc)
            {
                for (int i = 0; i < ITER; i++)
                {
                    for (int z = 0; z < XL*YL; z++)
                    {
                        if (p[z].Real > 4) continue;
                        p[z] = p[z] * p[z] + l[z];
                    }
                }
            }
        }
    }

score -1 · Answer 3 · answered Feb 16 '11 at 00:07

-1

Personally, if this is a major issue, I would create a C++ dll and then use that to do the arithmetic. You can call this plugin from C# so you can still take advantage of WPF and reflection etc.

One thing to note is that calling the plugin isn't exactly a "fast", so you want to ensure you pass ALL your data in one go and not call it very often.

answered Feb 16 '11 at 00:07

Chris

26,744
48
193
345

4

It's *extremely* unlikely that the performance discrepancy between C++ and C# is so large that it wouldn't be mitigated by the overhead of invoking a method from a DLL and marshaling data back and forth. This is one of those typically misguided answers that assumes C# and other .NET languages must be "toy" languages because they run managed code, and consequently the speed of a native language like C++ must blow them away by leaps and bounds. As compelling as that story seems to be to so many, it's also fallacious and consistently applying it gets nonsense suggestions like this one. – Cody Gray - on strike Feb 16 '11 at 01:38

How to do Speedy Complex Arithmetic in C#

3 Answers3