15

Executive Summary: Reed's answer below is the fastest if you want to stay in C#. If you're willing to marshal to C++ (which I am), that's a faster solution.

I have two 55mb ushort arrays in C#. I am combining them using the following loop:

float b = (float)number / 100.0f;
for (int i = 0; i < length; i++)
{
      image.DataArray[i] = 
          (ushort)(mUIHandler.image1.DataArray[i] + 
          (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
}

This code, according to adding DateTime.Now calls before and afterwards, takes 3.5 seconds to run. How can I make it faster?

EDIT: Here is some code that, I think, shows the root of the problem. When the following code is run in a brand new WPF application, I get these timing results:

Time elapsed: 00:00:00.4749156 //arrays added directly
Time elapsed: 00:00:00.5907879 //arrays contained in another class
Time elapsed: 00:00:02.8856150 //arrays accessed via accessor methods

So when arrays are walked directly, the time is much faster than if the arrays are inside of another object or container. This code shows that somehow, I'm using an accessor method, rather than accessing the arrays directly. Even so, the fastest I seem to be able to get is half a second. When I run the second listing of code in C++ with icc, I get:

Run time for pointer walk: 0.0743338

In this case, then, C++ is 7x faster (using icc, not sure if the same performance can be obtained with msvc-- I'm not as familiar with optimizations there). Is there any way to get C# near that level of C++ performance, or should I just have C# call my C++ routine?

Listing 1, C# code:

public class ArrayHolder
{
    int length;
    public ushort[] output;
    public ushort[] input1;
    public ushort[] input2;
    public ArrayHolder(int inLength)
    {
        length = inLength;
        output = new ushort[length];
        input1 = new ushort[length];
        input2 = new ushort[length];
    }

    public ushort[] getOutput() { return output; }
    public ushort[] getInput1() { return input1; }
    public ushort[] getInput2() { return input2; }
}


/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
    public MainWindow()
    {
        InitializeComponent();


        Random random = new Random();

        int length = 55 * 1024 * 1024;
        ushort[] output = new ushort[length];
        ushort[] input1 = new ushort[length];
        ushort[] input2 = new ushort[length];

        ArrayHolder theArrayHolder = new ArrayHolder(length);

        for (int i = 0; i < length; i++)
        {
            output[i] = (ushort)random.Next(0, 16384);
            input1[i] = (ushort)random.Next(0, 16384);
            input2[i] = (ushort)random.Next(0, 16384);
            theArrayHolder.getOutput()[i] = output[i];
            theArrayHolder.getInput1()[i] = input1[i];
            theArrayHolder.getInput2()[i] = input2[i];
        }

        Stopwatch stopwatch = new Stopwatch(); 
        stopwatch.Start();
        int number = 44;
        float b = (float)number / 100.0f;
        for (int i = 0; i < length; i++)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b * (float)input2[i]));
        } 
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        for (int i = 0; i < length; i++)
        {
            theArrayHolder.output[i] =
                (ushort)(theArrayHolder.input1[i] +
                (ushort)(b * (float)theArrayHolder.input2[i]));
        }
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        for (int i = 0; i < length; i++)
        {
            theArrayHolder.getOutput()[i] =
                (ushort)(theArrayHolder.getInput1()[i] +
                (ushort)(b * (float)theArrayHolder.getInput2()[i]));
        }
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
    }
}

Listing 2, C++ equivalent: // looptiming.cpp : Defines the entry point for the console application. //

#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <stdio.h>
#include <iostream>


int _tmain(int argc, _TCHAR* argv[])
{

    int length = 55*1024*1024;
    unsigned short* output = new unsigned short[length];
    unsigned short* input1 = new unsigned short[length];
    unsigned short* input2 = new unsigned short[length];
    unsigned short* outPtr = output;
    unsigned short* in1Ptr = input1;
    unsigned short* in2Ptr = input2;
    int i;
    const int max = 16384;
    for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
        *outPtr = rand()%max;
        *in1Ptr = rand()%max;
        *in2Ptr = rand()%max;
    }

    LARGE_INTEGER ticksPerSecond;
    LARGE_INTEGER tick1, tick2;   // A point in time
    LARGE_INTEGER time;   // For converting tick into real time


    QueryPerformanceCounter(&tick1);

    outPtr = output;
    in1Ptr = input1;
    in2Ptr = input2;
    int number = 44;
    float b = (float)number/100.0f;


    for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
        *outPtr = *in1Ptr + (unsigned short)((float)*in2Ptr * b);
    }
    QueryPerformanceCounter(&tick2);
    QueryPerformanceFrequency(&ticksPerSecond);

    time.QuadPart = tick2.QuadPart - tick1.QuadPart;

    std::cout << "Run time for pointer walk: " << (double)time.QuadPart/(double)ticksPerSecond.QuadPart << std::endl;

    return 0;
}

EDIT 2: Enabling /QxHost in the second example drops the time down to 0.0662714 seconds. Modifying the first loop as @Reed suggested gets me down to

Time elapsed: 00:00:00.3835017

So, still not fast enough for a slider. That time is via the code:

        stopwatch.Start();
        Parallel.ForEach(Partitioner.Create(0, length),
         (range) =>
         {
             for (int i = range.Item1; i < range.Item2; i++)
             {
                 output[i] =
                     (ushort)(input1[i] +
                     (ushort)(b * (float)input2[i]));
             }
         });

        stopwatch.Stop();

EDIT 3 As per @Eric Lippert's suggestion, I've rerun the code in C# in release, and, rather than use an attached debugger, just print the results to a dialog. They are:

  • Simple arrays: ~0.273s
  • Contained arrays: ~0.330s
  • Accessor arrays: ~0.345s
  • Parallel arrays: ~0.190s

(these numbers come from a 5 run average)

So the parallel solution is definitely faster than the 3.5 seconds I was getting before, but is still a bit under the 0.074 seconds achievable using the non-icc processor. It seems, therefore, that the fastest solution is to compile in release and then marshal to an icc-compiled C++ executable, which makes using a slider possible here.

EDIT 4: Three more suggestions from @Eric Lippert: change the inside of the for loop from length to array.length, use doubles, and try unsafe code.

For those three, the timing is now:

  • length: ~0.274s
  • doubles, not floats: ~0.290s
  • unsafe: ~0.376s

So far, the parallel solution is the big winner. Although if I could add these via a shader, maybe I could see some kind of speedup there...

Here's the additional code:

        stopwatch.Reset();

        stopwatch.Start();

        double b2 = ((double)number) / 100.0;
        for (int i = 0; i < output.Length; ++i)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b2 * (double)input2[i]));
        }

        stopwatch.Stop();
        DoubleArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        stopwatch.Reset();

        stopwatch.Start();

        for (int i = 0; i < output.Length; ++i)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b * input2[i]));
        }

        stopwatch.Stop();
        LengthArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        unsafe
        {
            fixed (ushort* outPtr = output, in1Ptr = input1, in2Ptr = input2){
                ushort* outP = outPtr;
                ushort* in1P = in1Ptr;
                ushort* in2P = in2Ptr;
                for (int i = 0; i < output.Length; ++i, ++outP, ++in1P, ++in2P)
                {
                    *outP = (ushort)(*in1P + b * (float)*in2P);
                }
            }
        }

        stopwatch.Stop();
        UnsafeArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
mmr
  • 14,781
  • 29
  • 95
  • 145
  • 2
    As an aside - If you're measuring the time it takes to run a method it's better to use the Stopwatch class. It's much more accurate than DateTime. – RichK May 12 '11 at 19:45
  • 1
    @RichK-- noted. But at 3.5 seconds, I'm not too concerned about ms accuracy. – mmr May 12 '11 at 19:46
  • 1
    can u give us a number example? – Ahmed Magdy May 12 '11 at 19:46
  • 1
    You can only process a huge chunk of data so fast. – mellamokb May 12 '11 at 19:48
  • @Pr0fess0rX-- number is between 0 and 200, integers-- gotten from a slider, under the expectation that this whole operation can be done via a slider. – mmr May 12 '11 at 19:50
  • You might want to look into various ways of parallelising the code. This seems like a good candidate. – ICR May 12 '11 at 19:51
  • Parallel is fine, but regardless, I would hit it with a few stackshots to make sure it's not doing anything unnecessary. For example, my eyes are drawn to those float-integer conversions. – Mike Dunlavey May 12 '11 at 21:13
  • 1
    I would take a look at the assembly generated. Compile the program as "release" and attach the debugger to the process *after you know that this code has been jitted*. Then examine the code in the debugger. What I would be looking for specifically is *has the jitter optimized away the bounds checks for the array?* If the answer is "no" then figure out how you can update the code so that either (1) the jitter has enough information to make this optimization, or (2) use the "fixed" keyword to obtain a raw pointer to the array and party on the pointers, thereby skipping the bounds checks. – Eric Lippert May 12 '11 at 21:22
  • ... for example, you could let b be a number in the range 0 to 2*256. Multiply by that, and divide by 256. No floats. – Mike Dunlavey May 12 '11 at 21:24
  • @Eric Lippert-- I'll take a look at doing this now, and running these functions via a button click or something. – mmr May 12 '11 at 21:34
  • 1
    @mmr: Oh goodness yes, if you were running these numbers inside a debugger all bets are off. The debugger does all kinds of de-optimizations to ensure that your debugging experience is pleasant. Never *ever* do performance tests inside a debugger! That's why I said you need to attach the debugger *after* the jitter has run; the jitter knows whether the process is being debugged or not and turns off optimizations if it is. – Eric Lippert May 12 '11 at 22:12
  • @Eric Lippert-- you are correct, I was being... sloppy is the nice word for it, I guess. My last edit shows the timing results of the various C# results from outside any debugging (not even an attached one) to C++. – mmr May 12 '11 at 22:24
  • 2
    A few more things to try. First off, what if you do the math in doubles rather than floats? Processors are optimized for doubles. No reason to use floats unless you are trying to save space; they save space but often do not save time. Second, what if you write the loop as "for(int i = 0; i < array.Length; ++i)" -- that looks wasteful to a C programmer as it is recalculating array.Length every time, but *the jitter can recognize this common pattern and optimize the loop*. Third, it would be interesting to know whether you get a win from using raw pointers. – Eric Lippert May 12 '11 at 22:30
  • @Eric Lippert-- I've updated my answer with your suggestions. – mmr May 12 '11 at 22:54
  • Surprised no one has mentioned this yet, but perhaps a bit of manual loop unrolling would help? Also it looks like a multiply-add operation, so SSE2 or SSE3 could definitely come to the rescue here, but C# on Windows does not support SSE intrinsics. I hear Mono has some support in this area, so that might be worth a shot if Mono can also meet your application's needs outside this one particular function. That being said, could Mono compile a native image including the SSE intrinsics and be callable from Windows .NET impl? Of course, you could do that part with C/C++ too. :) – James Dunne May 13 '11 at 01:17
  • Addendum: http://stackoverflow.com/questions/415375/using-sse-in-c-is-it-possible/467637#467637 Just as I suspected. :) – James Dunne May 13 '11 at 01:20
  • I am surprised that raw pointers are slower. I wonder what's going on there? – Eric Lippert May 13 '11 at 03:48
  • @Eric Lippert-- if it would be of general interest (or perhaps jus yours), I can put the projects up on github. It might be interesting to run on different processors, see if there's any differences there. – mmr May 13 '11 at 03:58
  • on 32bit machine I find raw pointers to be faster (by a smidge) than the basic array access. interestingly manual loop unrolling seems to have little effect so I wonder if it's either the body of the loop or the raw data throughput that's the problem. Reading2/writing1 55million arrays is probably going to be dependent on the memory throughput and how much the cache infrastructure reads ahead. If the parrallel approach manages to split such that there is no accidental cache locking traffic the I suspect the parallel nature is just helping it to benefit during stalls on memory access. – ShuggyCoUk May 13 '11 at 09:40
  • Changing it to be a plain addition in the loopp body brings it down to ~0.11 secs (0.20 is the best I could do with raw pointers in the original) so really I think it's all about the Von Neumann bottleneck and ways to mitigate it. – ShuggyCoUk May 13 '11 at 09:42

2 Answers2

21

This should be perfectly parallelizable. However, given the small amount of work being done per element, you'll need to handle this with extra care.

The proper way to do this (in .NET 4) would be to use Parallel.ForEach in conjunction with a Partitioner:

float b = (float)number / 100.0f;
Parallel.ForEach(Partitioner.Create(0, length), 
(range) =>
{
   for (int i = range.Item1; i < range.Item2; i++)
   {
      image.DataArray[i] = 
          (ushort)(mUIHandler.image1.DataArray[i] + 
          (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
   }
});

This will efficiently partition the work across available processing cores in your system, and should provide a decent speedup if you have multiple cores.

That being said, this will, at best, only speed up this operation by the number of cores in your system. If you need to speed it up more, you'll likely need to revert to a mix of parallelization and unsafe code. At that point, it might be worth thinking about alternatives to trying to present this in real time.

Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
  • @Reed-- thanks for this one. If I were to try the unsafe route, where would I start? And if I did, would it be possible to mix unsafe code and this kind of autoparallelization? – mmr May 12 '11 at 19:56
  • @mmr: Yes. You could, potentially, do this faster by using pointer manipulation. You'd use the same technique above, but then cast the arrays to a pointer, and manipulate directly. It's a small improvement (typically) over a for loop, but given the number, it might help speed things up a bit. I'd profile it, though, and only do it if it really helps... – Reed Copsey May 12 '11 at 20:09
  • @Reed-- doing this shaves a second off of the time. I'm just kind of flabbergasted by this, since C++ handles this in milliseconds. – mmr May 12 '11 at 20:27
  • @mmr: If you have C++ code that is so much faster, it doesn't do the same thing. It's not that big performance difference between C# and C++. I tested to do this using unsafe mode and pointers to access the data, and there was no performance difference at all. – Guffa May 12 '11 at 20:33
  • @Guffa: That's been my experience, as well. Switching to unsafe, C# very often will outperform C++ for operations like this (provided you're not taking advantage of SSE, etc) – Reed Copsey May 12 '11 at 21:02
  • @Guffa, @Reed-- see my update. I made a version of the code in C++ (compiled without processor specific optimizations or SSE; adding SSE brings the time down further), and it's just faster than the C# code by quite a bit. – mmr May 12 '11 at 21:24
  • @mmr: If you put this in an unsafe context, and wrote your C# code using the same technique as your C++ code (ie: via pointers), it'll probably be very similar... – Reed Copsey May 12 '11 at 22:45
  • @Reed: your solution is actually faster than the unsafe context, unless I did my unsafe context code wrong. – mmr May 12 '11 at 22:53
  • @mmr: I was actually thinking of mixing unsafe with my solution when I mentioned that originally - you can parallelize this in an unsafe context, and use my solution to just work via pointers instead of looping. It may not be faster, however... – Reed Copsey May 12 '11 at 22:57
  • @Reed: 0.188 (unsafe) vs 0.148 (managed). The managed parallel solution seems to be getting faster with each run, so maybe the unsafe version will also be faster as more is done. – mmr May 12 '11 at 23:05
8

Assuming you have a lot of these guys, you can attempt to parallelize the operation (and you're using .NET 4):

Parallel.For(0, length, i=>
   {
       image.DataArray[i] = 
      (ushort)(mUIHandler.image1.DataArray[i] + 
      (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
   });

Of course that is all going to depend on whether or not parallelization of this would be worth it. That statement looks fairly computationally short; accessing indices by number is pretty fast as is. You might get gains because this loop is being run so many times with that much data.

Tejs
  • 40,736
  • 10
  • 68
  • 86
  • This will not be very effective in practice due to the small amount of work in the loop body. See my answer for a more effective alternative. – Reed Copsey May 12 '11 at 19:52