3

In the application which I'm currently developing, I must sum pretty big arrays of vectors efficiently. Here's my code:

public List<double[, ,]> normalMaps;

public double[, ,] Mix(double[] weights, double gain)
{
    int w, h;
    w = normalMaps[0].GetLength(0);
    h = normalMaps[0].GetLength(1);

    double[, ,] ret = new double[w, h, 3];
    int normcount = normalMaps.Count;

    //for (int y = 0; y < h; y++)
    Parallel.For(0, h, y =>            
    {
        for (int x = 0; x < w; x++)
        {
            for (int z = 0; z < normcount; z++) 
            {
                ret[x, y, 0] += normalMaps[z][x, y, 0] * weights[z];
                ret[x, y, 1] += normalMaps[z][x, y, 1] * weights[z];
                ret[x, y, 2] += normalMaps[z][x, y, 2] * weights[z];
            }
            ret[x, y, 0] *= gain;
            ret[x, y, 1] *= gain;
            ret[x, y, 2] *= gain;

            ret[x, y, 0] = Math.Max(-1, Math.Min(1, ret[x, y, 0]));
            ret[x, y, 1] = Math.Max(-1, Math.Min(1, ret[x, y, 1]));
            ret[x, y, 2] = Math.Max(-1, Math.Min(1, ret[x, y, 2]));

            double retnorm = Math.Sqrt(ret[x, y, 0] * ret[x, y, 0] + ret[x, y, 1] * ret[x, y, 1] + ret[x, y, 2] * ret[x, y, 2]);
            ret[x, y, 0] /= retnorm;
            ret[x, y, 1] /= retnorm;
            ret[x, y, 2] /= retnorm;

        }
    });

    return ret;
}

Now, when I try to sum 7 1024*1024 arrays of 3-component vectors, the operation takes 320 ms on my laptop. Making the code multithreaded gave me already a huge performance boost. But I need to make it even faster. How can I optimize it even more? I can already see I could use a simple array instead of a List<>, that would make the code faster, but not much. Is there really nothing left to optimize? I was thinking about moving this thing to GPU, but it's just an idea. Can somebody help me out? Thanks in advance.

Piotr Joniec
  • 157
  • 1
  • 2
  • 13

4 Answers4

5

You'll get your code from 270ms to 0ms if you know the fact that you are iterating the dimensions in a bit inefficient order, which causes false sharing. You are essentially parallelizing "width", instead of height. You might be confusing the way how arrays are stored in memory.

The false-sharing is not the only problem, due to the fact how computers work, you are iterating over things in a cache-inefficient way.

Usually array definitions should be myArray[HEIGHT, WIDTH] to keep it consistent with memory storage, and when iterating, the height should be outermost.

Parallel.For(0, w, x =>            
{
    for (int y = 0; y < h; y++)
    {
       ...
    }
}

That took me from 800ms to 150ms, while having equal dimensions, just by swapping the few things.

Erti-Chris Eelmaa
  • 25,338
  • 6
  • 61
  • 78
  • Just swapping x and y as you wrote made the compute time go from 280ms to 130 ms. BUT. I was using the same kind of loop in many different places of my program. So with this single answer you just optimized my ENTIRE PROGRAM. THANK. YOU. – Piotr Joniec Oct 24 '14 at 14:57
  • 2
    No worries, learned myself few things, namely check this: http://stackoverflow.com/questions/468832/why-are-multi-dimensional-arrays-in-net-slower-than-normal-arrays the benchmark seems to indicate that multi-dimensional array is 2x slower than jagged array, lol. Perhaps you can get to 75ms even.. – Erti-Chris Eelmaa Oct 24 '14 at 15:21
  • @PiotrJoniec, that's what I mean by broken. – Jodrell Oct 24 '14 at 15:32
2

As you mentioned, swapping that List<> out for an array will give a noticeable performance boost.

If you switch to arrays, you could also make use of pointers to iterate the values. You'll take a small performance hit for pinning it so it doesn't get moved by the GC but considering the size, the pros should out-weigh the cons. You see this done a fair bit within the .NET framework's source to squeeze every drop of performance they can from hefty iterations.

You may be able to utilize the new SIMD support for the actual calculations but I don't know enough on the subject to be able to give more details. I should also mention that the new SIMD features in .NET aren't fully complete yet and still in beta.

Mike Johnson
  • 686
  • 5
  • 13
  • I've just tried out using arrays instead of List<> - calculation time went down from 320ms to 270ms. I wasn't expecting that much. Thanks for the answer, I'll have to try to use pointers just as you said. – Piotr Joniec Oct 24 '14 at 13:34
1

I bet you can double the speed if you swap the X and Y loops:

public double[, ,] Mix(double[] weights, double gain)
{
    int w, h;
    w = normalMaps[0].GetLength(0);
    h = normalMaps[0].GetLength(1);

    double[, ,] ret = new double[w, h, 3];
    int normcount = normalMaps.Count;

    //for (int y = 0; y < h; y++)
    Parallel.For(0, w, x =>            
    {
        for (int y = 0; y < h; y++)
        {
           .
           .
           .
        }
    });

    return ret;
}

You want the innermost loop to be on the last array index, and the outermost loop to be the first array index. This results in the most cache-coherent approach. The compiler also doesn't have to do a multiply at each array index lookup, it just does an index. (I think can explain that better if it would help...)

EDIT: I have 2 other optimizations that can get another 15%. One is to do the same change, but with the Z. To do that, the Z loop needs to be pulled out of the main loop. This means going over the data twice, but it is still worth it. The other is to eliminate the extra lookups caused by the lookup of normalMaps[z] 3 times. Please do verify that the results are the same: I think it was okay to do this as a separate step but maybe I missed something.

// Extract Z loop
Parallel.For(0, normcount, z =>
//for (int z = 0; z < normcount; z++)
{
    //Parallel.For(0, w, x =>
    for (int x = 0; x < w; x++)
    {
        // I don't know why the compiler isn't smart enough to do this itself but it actually matters
        double[, ,] temp = normalMaps[z]; 
        //Parallel.For(0, h, y =>
        for (int y = 0; y < h; y++)
        {
            ret[x, y, 0] += temp[x, y, 0] * weights[z];
            ret[x, y, 1] += temp[x, y, 1] * weights[z];
            ret[x, y, 2] += temp[x, y, 2] * weights[z];
        }
    };
});

Parallel.For(0, w, x =>
{
    for (int y = 0; y < h; y++)
    {
        //Parallel.For(0, normcount, z =>
        ret[x, y, 0] *= gain;
        ret[x, y, 1] *= gain;
        ret[x, y, 2] *= gain;

        ret[x, y, 0] = Math.Max(-1, Math.Min(1, ret[x, y, 0]));
        ret[x, y, 1] = Math.Max(-1, Math.Min(1, ret[x, y, 1]));
        ret[x, y, 2] = Math.Max(-1, Math.Min(1, ret[x, y, 2]));

        double retnorm = Math.Sqrt(ret[x, y, 0] * ret[x, y, 0] + ret[x, y, 1] * ret[x, y, 1] + ret[x, y, 2] * ret[x, y, 2]);
        ret[x, y, 0] /= retnorm;
        ret[x, y, 1] /= retnorm;
        ret[x, y, 2] /= retnorm;

    };
});
Moby Disk
  • 3,761
  • 1
  • 19
  • 38
1

Try this,

private double[,,] Mix(double[][,,] normalMaps, double[] weights, double gain)
{
    var w = normalMaps[0].GetLength(0);
    var h = normalMaps[0].GetLength(1);

    var result = new double[w, h, 3];
    var mapCount = normalMaps.Length;

    Parallel.For(0, w, x =>            
    {
        for (int y = 0; y < h; y++)
        {
            OneStack(
                x,
                y,
                mapCount,
                normalMaps,
                weights,
                gain,
                result));
        }
    }
    return result;
}

private static void OneStack(
    int x,
    int y,
    int mapCount,
    double[][,,] normalMaps,
    double[] weights,
    double gain,
    double[,,] result)
{
    var weight = weights[0];
    var z0 = normalMaps[0][x, y, 0] * weight;
    var z1 = normalMaps[0][x, y, 1] * weight;
    var z2 = normalMaps[0][x, y, 2] * weight;

    for (var i = 1; i < mapCount; i++)
    {
        weight = weights[i];

        z0 += normalMaps[i][x, y, 0] * weight;
        z1 += normalMaps[i][x, y, 1] * weight;
        z2 += normalMaps[i][x, y, 2] * weight;
    }

    z0 = Math.Max(-1, Math.Min(1, z0 * gain));
    z1 = Math.Max(-1, Math.Min(1, z1 * gain));
    z2 = Math.Max(-1, Math.Min(1, z2 * gain));

    var norm = Math.Sqrt(z0 * z0 + z1 * z1 + z2 * z2);

    result[x, y, 0] = z0 / norm;
    result[x, y, 1] = z1 / norm;
    result[x, y, 2] = z2 / norm;
}

I'm anticipating an improvement because the number of assigns and accesses involving the large multi-dimensional array are minimised. Whilst this comes at the cost of extra instantiations, I anticipate the cost of using the MD array to be larger. Multi-Dimensional arrays are essentially broken in .Net.

Erti-Chris Eelmaa
  • 25,338
  • 6
  • 61
  • 78
Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • I changed my program to create arrays in the [H,W] order, not [W,H]. So I swapped some variables in your code to use the current order. It made my code slower. And also it returns different values. The Math.Max and Math.Min part. It is supposed to clip any values above 1 and below -1, and then be multiplied by the variable gain. Multiplying it FIRST results in different answers. Example: clip( [0.1, 0, 2.0] * 2 ) = [0.2, 0, 1] ||| clip( [0.1, 0, 2.0] ) * 2 = [0.2, 0, 2] – Piotr Joniec Oct 24 '14 at 15:27
  • @PiotrJoniec I suspect I've over parallelized things in the outer function. – Jodrell Oct 24 '14 at 15:33
  • @PiotrJoniec, in your example (in the question) you multiply by `gain` first. – Jodrell Oct 24 '14 at 15:34
  • Wow, I had no idea. :\ But somehow I still get different results. – Piotr Joniec Oct 24 '14 at 15:41
  • To visualize: The result I'm supposed to get: http://i.imgur.com/76LZfAt.jpg The result I'm getting from your code: http://i.imgur.com/JREA75d.jpg – Piotr Joniec Oct 24 '14 at 15:46
  • @PiotrJoniec the normalization was definitely different, the `Math.Sqrt` bit. I changed it. – Jodrell Oct 24 '14 at 16:09
  • @PiotrJoniec, I've rolled back the loop too. – Jodrell Oct 24 '14 at 16:12
  • You know what? It actually works now. From 132 ms to 67 ms (!). Thank you sooooooooo much. I've learned a lot from you! I'll write more efficient code from now on! I guess if I now make use of jagged arrays instead of multidimensional arrays, I can squeeze even more performance from here. – Piotr Joniec Oct 24 '14 at 16:40
  • The posted code doesn't compile at the first line of OneStack because var weight = weights[i]; is called before i is defined. weight also isn't updated in the loop. I'm unclear what was intended with that part or I'd fix it. – Moby Disk Oct 24 '14 at 17:38
  • 1
    @PiotrJoniec; How often do things get clipped? If it's often, I believe you can do pretty good optimization in mapCount for() loop. Basically check if each component(z0, z1, z2) has exceeded -1 or 1, and then break the loop. – Erti-Chris Eelmaa Oct 24 '14 at 19:26
  • @PiotrJoniec Indeed, I suspect jagged arrays would be faster, only one way to be sure. – Jodrell Oct 30 '14 at 08:27
  • @Jodrell; just a question: I did run a benchmark(the one given in the original thread: http://stackoverflow.com/questions/468832/why-are-multi-dimensional-arrays-in-net-slower-than-normal-arrays), and it indicated in no way that multidimensional array was slower. The reason I am saying this, I believe the speed you've gained is because you've essentially minimized "cache-trashing" between different processors. (False sharing), thoughts? – Erti-Chris Eelmaa Nov 03 '14 at 12:50
  • @ChrisEelmaa you are probably right to some extent, you'll note my code incorporates your change, my initial suggestion was over "parallelized". I suspect the majority of improvements come from bringing the variables close or local to the work and minimizing the number of times the function needs to reach out to a large data structure. This fits nicely with how computers work inside. How this is actually implemented by the Compiler, Jitter, microcode etc will vary but it should be easier to optimize. – Jodrell Nov 03 '14 at 13:34