3

I'm working with CT images, so I need to operate 3D arrays a lot, but found that C# deals with arrays extremely slow. I wrote a simple BFS test, based on https://stackoverflow.com/a/1056091:

struct Vector
{
    public int X, Y, Z;
}
int width = 512, height = 512, depth = 200;
public static int bfs(int[, ,] matrix)
{
    bool[, ,] visited = new bool[width, height, depth];
    int sum = 0;
    Queue<Vector> queue = new Queue<Vector>();
    queue.Enqueue(new Vector { X = 0, Y = 0, Z = 0 });
    while (queue.Count > 0)
    {
        Vector c = queue.Dequeue();
        int cx = c.X;
        int cy = c.Y;
        int cz = c.Z;
        sum += matrix[cx, cy, cz];
        int x, y, z;

        if ((x = cx - 1) >= 0 && !visited[x, cy, cz])
        { queue.Enqueue(new Vector { X = x, Y = cy, Z = cz }); visited[x, cy, cz] = true; }
        if ((y = cy - 1) >= 0 && !visited[cx, y, cz])
        { queue.Enqueue(new Vector { X = cx, Y = y, Z = cz }); visited[cx, y, cz] = true; }
        if ((z = cz - 1) >= 0 && !visited[cx, cy, z])
        { queue.Enqueue(new Vector { X = cx, Y = cy, Z = z }); visited[cx, cy, z] = true; }
        if ((x = cx + 1) < width && !visited[x, cy, cz])
        { queue.Enqueue(new Vector { X = x, Y = cy, Z = cz }); visited[x, cy, cz] = true; }
        if ((y = cy + 1) < height && !visited[cx, y, cz])
        { queue.Enqueue(new Vector { X = cx, Y = y, Z = cz }); visited[cx, y, cz] = true; }
        if ((z = cz + 1) < depth && !visited[cx, cy, z])
        { queue.Enqueue(new Vector { X = cx, Y = cy, Z = z }); visited[cx, cy, z] = true; }
    }
    return sum;
}
public static void TestTime<T, TR>(Func<T, TR> action, T obj, int iterations)
{
    Stopwatch stopwatch = Stopwatch.StartNew();
    for (int i = 0; i < iterations; ++i)
        action(obj);
    Console.WriteLine(action.Method.Name + " took " + stopwatch.Elapsed);
}
static void Main(string[] args)
{
    Random random = new Random();
    int[, ,] m1 = new int[width, height, depth];
    for (int x = 0; x < width; ++x)
        for (int y = 0; y < height; ++y)
            for (int z = 0; z < depth; ++z)
                m1[x, y, z] = random.Next();
    const int iterations = 1;
    TestTime<int[, ,], int>(bfs, m1, iterations);
    TestTime<int[, ,], int>(bfs, m1, iterations);
}

This takes ~23 secs (one iteration). Compiled with VS2012, "Release"d, optimized. Tried to write analogous C++ code (with stack allocated arrays), it takes ~0,08sec. DFS (with Stack instead of Queue) runs slower. 1D or jagged arrays give almost same results. I should note that continuous access (as in Why are multi-dimensional arrays in .NET slower than normal arrays?) runs fine, 100 iterations of [512x512x200] array in ~26 secs.

Why non-continuous array access is so slow in C#? Is there any way to speed it up?

Community
  • 1
  • 1
MykolasB
  • 47
  • 6
  • 5
    Run in a profiler and see if you spot any 'hot' areas. – leppie Jan 29 '15 at 11:20
  • 3
    http://www.yankeerino.com/cpp_vs_csharp_arrays.bhs – H W Jan 29 '15 at 11:29
  • 2
    you seem to measure allocation and the queue and arrays together. What happens if you write benchmarks for array access alone? – DrKoch Jan 29 '15 at 11:31
  • 2
    Do you really need 3 `Queue`s? Try using a single `Queue>`. – Mathias Becher Jan 29 '15 at 11:39
  • You can see results of a profiling session [here](http://oi59.tinypic.com/azb86a.jpg) – oleksii Jan 29 '15 at 11:53
  • @DrKoch How to write benchmarks for array access alone? I can exclude allocation, but how can I exclude queue operations on BFS? I've said linear access is OK. – MykolasB Jan 29 '15 at 11:58
  • I meant a (slightly) different algorithm which simply reads and writes array cells - just to compare it with the C++ version. – DrKoch Jan 29 '15 at 12:06
  • @MykolasB So it's the queues. – Mathias Becher Jan 29 '15 at 12:40
  • @DrKoch [pastebin.com/5EhQ32bd](http://pastebin.com/5EhQ32bd) is the C++ code I've tested – MykolasB Jan 29 '15 at 12:50
  • 8
    A difference of 250x should make you very suspicious. Your benchmark is simply invalid. – usr Jan 29 '15 at 13:43
  • 4
    Your C++ and C# code is sufficiently different. In C++ you have one queue with tuples. Why not do the same in C#? – Berin Loritsch Feb 02 '15 at 17:50
  • @BerinLoritsch I wrote BFS with Tuples firstly, but ILDASM showed there is a call for getting items from Tuple. This version runs faster – MykolasB Feb 03 '15 at 18:03
  • Your multidimensional array takes up 200 MB (512 * 512 * 200 * 4(bytes in an int)). Do you have any idea what's happening with the garbage collector? How much RAM are you running this with? Are you warming up the runtime before doing your benchmark? All of these surrounding issues might be artificially raising the time it takes to perform. – Berin Loritsch Feb 03 '15 at 18:14
  • As other people have pointed out, your question isn't valid because your premise is faulty. The code compares apples and oranges. – George Stocker Feb 03 '15 at 18:18
  • @BerinLoritsch How can I debug garbage collector? 310MB of RAM is used (270MB when running first iteration). I measuring 2 iterations for warming up, but timings are alomst the same – MykolasB Feb 03 '15 at 18:45
  • @GeorgeStocker What so you mean by apples and oranges? I'm comparing continuous arrays access and and array access with BFS, and facing inadequate computation time because of different access ways – MykolasB Feb 03 '15 at 18:51
  • @MykolasB You're not just doing array access in the C# code; you're enqueueing and doing array access; so either your title is wrong, or your code is; also; Berin brings up several points that are worth noting. – George Stocker Feb 03 '15 at 18:54
  • @GeorgeStocker How can I write code to access arrays non-cotinuously? Randomly? Generation of random numbers would have larger time impact than queueing points. BTW, I wrote answer for Berin – MykolasB Feb 03 '15 at 19:08
  • 2
    2 times is not enough to warmup: http://blogs.perpetuumsoft.com/dotnet/learn-how-to-create-correct-c-benchmarks/ sometimes there's just in time recompiling and that takes a few iterations. – Berin Loritsch Feb 03 '15 at 19:49
  • Also make sure you GC.Collect() and GC.WaitForFinalizers() after warming up. The best way to test memory use is to profile it. Look for unnecessary allocations. – Berin Loritsch Feb 03 '15 at 19:55
  • @BerinLoritsch Why `GC.Collect()` and `.WaitForPendingFinalizers()` can affect computation speed? I haven't spot any difference? On the other hand, it seems there is a speed up of few second after good warming up. But I want it to be at least 100x. I wouldn't have written SO question if the difference between C++ was 10x. I guess there should be compiler or hardcoded settings. – MykolasB Feb 04 '15 at 19:05
  • The garbage collector could still be finalizing dead objects in the background after calling GC.Collect(), so that just eliminates the residual effects. The warmup period lets you really see the difference that changes to an algorithm makes. I'm wondering if you can handle the visitor flagging better. – Berin Loritsch Feb 04 '15 at 19:49
  • If anyone will read this, there was a bug in my C++ code when checking `visited` flags (it should be `visited[x][y][z]` instead of `visited[x, y, z]`). Now C++ version runs in ~6 secs. – MykolasB Mar 26 '15 at 14:38

1 Answers1

0

I changed to using list+struct to see if it helped. It halved the time. I changed the multi-dimension array to to use a single dimension and hand-multiplied the indexer out. If anything, that made it worse.

struct Q
{
    public int X, Y, Z;
}
public static int bfs(int[, ,] matrix)
{
    int d1 = matrix.GetLength(0), d2 = matrix.GetLength(1), d3 = matrix.GetLength(2);
    var visited = new bool[d1, d2, d3];
    int sum = 0;
    var q = new List<Q>();
    q.Add(new Q());
    while (q.Count > 0)
    {
        var last = q[q.Count - 1];
        q.RemoveAt(q.Count - 1);
        int cx = last.X;
        int cy = last.Y;
        int cz = last.Z;
        sum += matrix[cx, cy, cz];
        int x, y, z;

        if ((x = cx - 1) >= 0 && !visited[x, cy, cz])
        { q.Add(new Q{X = x, Y = cy, Z = cz}); visited[x, cy, cz] = true; 
...
Nikolay Kostov
  • 16,433
  • 23
  • 85
  • 123
Brannon
  • 5,324
  • 4
  • 35
  • 83
  • I admit using struct should be better, but I want at least 100x speedup :) – MykolasB Feb 03 '15 at 18:08
  • BTW, your List is imitating Stack, so its DFS. DFS is even worse (confirming it's a problem of array access). Changing Stack to List doesn't have time impact – MykolasB Feb 03 '15 at 19:25
  • I've tried my code with struct, got worse results. But I'll update my question because of complains – MykolasB Feb 03 '15 at 20:11