I'm working on a C# Fractal Generator project right now that requires lots of Arithmetic with Complex numbers, and I'm trying to think of ways to speed up the math. Below is a simplified set of code that tests the speed of a Mandelbrot calculation using one of three data storage methods, shown in TestNumericsComplex
, TestCustomComplex
, and TestPairedDoubles
. Please understand that the Mandelbrot is just an example, and I intend for future developers to be able to create plug-in fractal formulas.
Basically I see that using System.Numerics.Complex
is an ok idea, while using a pair of doubles or a custom Complex struct are passable ideas. I can perform the arithmetic using the gpu, but wouldn't that limit or break portability? I've tried varying the order of the inner loops (i, x, y) to no avail. What else can I do to help speed up the inner loops? Am I running into page fault issues? Would using a fixed-point number system gain me any speed as opposed to the floating-point values?
I'm already aware of Parallel.For
in C# 4.0; it is omitted from my code samples for clarity. I'm also aware that C# is not usually a good language for high-performance; I'm using C# to take advantage of Reflection for plugins and WPF for windowing.
using System;
using System.Diagnostics;
namespace SpeedTest {
class Program {
private const int ITER = 512;
private const int XL = 1280, YL = 1024;
static void Main(string[] args) {
var timer = new Stopwatch();
timer.Start();
//TODO use one of these two lines
//TestCustomComplex();
//TestNumericsComplex();
//TestPairedDoubles();
timer.Stop();
Console.WriteLine(timer.ElapsedMilliseconds);
Console.ReadKey();
}
/// <summary>
/// ~14000 ms on my machine
/// </summary>
static void TestNumericsComplex() {
var vals = new System.Numerics.Complex[XL,YL];
var loc = new System.Numerics.Complex[XL,YL];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y] = new System.Numerics.Complex((x - XL/2)/256.0, (y - YL/2)/256.0);
vals[x, y] = new System.Numerics.Complex(0, 0);
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if(vals[x,y].Real>4) continue;
vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
}
}
}
/// <summary>
/// ~17000 on my machine
/// </summary>
static void TestPairedDoubles() {
var vals = new double[XL, YL, 2];
var loc = new double[XL, YL, 2];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y, 0] = (x - XL / 2) / 256.0;
loc[x, y, 1] = (y - YL / 2) / 256.0;
vals[x, y, 0] = 0;
vals[x, y, 1] = 0;
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if (vals[x, y, 0] > 4) continue;
var a = vals[x, y, 0] * vals[x, y, 0] - vals[x, y, 1] * vals[x, y, 1];
var b = vals[x, y, 0] * vals[x, y, 1] * 2;
vals[x, y, 0] = a + loc[x, y, 0];
vals[x, y, 1] = b + loc[x, y, 1];
}
}
}
/// <summary>
/// ~16900 ms on my machine
/// </summary>
static void TestCustomComplex() {
var vals = new Complex[XL, YL];
var loc = new Complex[XL, YL];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y] = new Complex((x - XL / 2) / 256.0, (y - YL / 2) / 256.0);
vals[x, y] = new Complex(0, 0);
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if (vals[x, y].Real > 4) continue;
vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
}
}
}
}
public struct Complex {
public double Real, Imaginary;
public Complex(double a, double b) {
Real = a;
Imaginary = b;
}
public static Complex operator + (Complex a, Complex b) {
return new Complex(a.Real + b.Real, a.Imaginary + b.Imaginary);
}
public static Complex operator * (Complex a, Complex b) {
return new Complex(a.Real*b.Real - a.Imaginary*b.Imaginary, a.Real*b.Imaginary + a.Imaginary*b.Real);
}
}
}
EDIT
GPU seems to be the only feasible solution; I disregard interoperability with C/C++ because I don't feel the speed up would be significant enough to coerce me to forcing interoperability on future plugins.
After looking into the available GPU options (which I've actually been examining for some time now), I've finally found what I believe is an excellent compromise. I've chosen OpenCL with the hope that most devices will support the standard by the time my program is released. OpenCLTemplate uses cloo to provide an easy-to-understand interface between .Net (for application logic) and "OpenCL C99" (for parallel code). Plugins can include OpenCL kernels for hardware acceleration alongside the standard implementation with System.Numerics.Complex for ease of integration.
I expect the number of available tutorials on writing OpenCL C99 code to grow rapidly as the standard becomes adopted by processor vendors. This keeps me from needing to enforce GPU coding on plugin developers while providing them with a well formulated language should they choose to take advantage of the option. It also means that IronPython scripts will have equal access to GPU acceleration despite being unknown until compile-time, since the code will translate directly through OpenCL.
For anyone in the future interested in integrating GPU acceleration with a .Net project, I highly recommend OpenCLTemplate. There is an admitted overhead of learning OpenCL C99. However, it is only slightly harder than learning an alternative API and will likely have better support from examples and general communities.