Can this code be optimised?

Question

I have some image processing code that loops through 2 multi-dimensional byte arrays (of the same size). It takes a value from the source array, performs a calculation on it and then stores the result in another array.

int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);

for (int x = 0; x < xSize; x++)
{                
   for (int y = 0; y < ySize; y++) 
   {                                                
      ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
                                    (AlphaImageData[x, y] * OneMinusAlphaValue));
   }
}

The loop currently takes ~11ms, which I assume is mostly due to accessing the byte arrays values as the calculation is pretty simple (2 multiplications and 1 addition).

Is there anything I can do to speed this up? It is a time critical part of my program and this code gets called 80-100 times per second, so any speed gains, however small will make a difference. Also at the moment xSize = 768 and ySize = 576, but this will increase in the future.

Update: Thanks to Guffa (see answer below), the following code saves me 4-5ms per loop. Although it is unsafe code.

int size = ResultImageData.Length;
int counter = 0;
unsafe
{
    fixed (byte* r = ResultImageData, c = CurrentImageData, a = AlphaImageData)
    {
        while (size > 0)
        {
            *(r + counter) = (byte)(*(c + counter) * AlphaValue + 
                                    *(a + counter) * OneMinusAlphaValue);
            counter++;
            size--;
        }
    }
}

@Andrew Arnott: Although completely correct, also completely useless. ;) — Guffa, Mar 31 '09 at 18:21
Can you update with the timing for the code in the accepted answer? It would be interesting to know how much difference it makes saving the 3 additions of counter per loop iteration. — Peter Mortensen, Aug 03 '09 at 17:43
If you look at the "UPDATE" section of my question it's there. The code based on the accepted answer takes 6-7 ms per loop, compared to ~11 ms for the original code. Does this help or are you asking about some other version of the code? — Matt Warren, Aug 04 '09 at 08:16

score 5 · Answer 1 · edited May 23 '17 at 12:34

These are all independent calculations so if you have a multicore CPU you should be able to gain some benefit by parallelizing the calculation. Note that you'd need to keep the threads around and just hand them work to do since the overhead of thread creation would probably make this slower rather than faster if the threads are recreated each time.

The other thing that may work is farming the work off to the graphics processor. Look at this question for some ideas, for example, using Accelerator.

score 5 · Accepted Answer · edited Sep 19 '13 at 09:56

5

To get any real speadup for this code you would need to use pointers to access the arrays, that removes all the index calculations and bounds checking.

int size = ResultImageData.Length;
unsafe 
{
   fixed(byte* rp = ResultImageData, cp = CurrentImageData, ap = AlphaImageData) 
   {
      byte* r = rp;
      byte* c = cp;
      byte* a = ap;
      while (size > 0) 
      {
         *r = (byte)(*c * AlphaValue + *a * OneMinusAlphaValue);
         r++;
         c++;
         a++;
         size--;
      }
   }
}

Edit:
Fixed variables can't be changed, so I added code to copy the pointers to new pointers that can be changed.

edited Sep 19 '13 at 09:56

Indy9000

8,651
2
32
37

answered Mar 31 '09 at 18:17

Guffa

687,336
108
737
1,005

If AlphaValue and OneMinusAlphaValue are floating point you might get a further increase in speed by using fixed point maths. Conversions from float to integer can be surprisingly expensive. – Bids Mar 31 '09 at 19:43
When I try this code I get the following errors: Cannot assign to 'r' because it is a 'fixed variable' Cannot assign to 'c' because it is a 'fixed variable' Cannot assign to 'a' because it is a 'fixed variable' Am I doing something wrong? (I've already added the "/unsafe" flag to my project) – Matt Warren Apr 01 '09 at 15:16
It's okay I fixed it, see the updated code in my edited question. Thanks for the help your method is 4-5ms faster, which makes a big difference. – Matt Warren Apr 01 '09 at 16:48
I see... You can't change the fixed variables, so you have to copy them to pointers. I'll update the code. – Guffa Apr 01 '09 at 16:52

score 4 · Answer 3 · answered Mar 31 '09 at 17:44

An option would be to use unsafe code: fixing the array in memory and use pointer operations. I doubt the speed increase will be that dramatic though.

One note: how are you timing? If you are using DateTime then be aware that this class has poor resolution. You should add an outer loop and repeat the operation say ten times -- I bet the result is less than 110ms.

for (int outer = 0; outer < 10; ++outer)
{
    for (int x = 0; x < xSize; x++)
    {                
         for (int y = 0; y < ySize; y++) 
         {                                                
              ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
                                             (AlphaImageData[x, y] * OneMinusAlphaValue));
         }
    }
}

score 4 · Answer 4 · edited May 23 '17 at 10:27

4

Since it appears that each cell in the matrix is calculated entirely independent of the others. You may want to look into having more than one thread handle this. To avoid the cost of creating threads you could have a thread pool.

If the matrix is of sufficient size, it could be a very nice speed gain. On the other hand, if it is too small, it may not help (even hurt). Worth a try though.

An example (pseudo code) could be like this:

void process(int x, int y) {
    ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
        (AlphaImageData[x, y] * OneMinusAlphaValue));
}

ThreadPool pool(3); // 3 threads big

int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);

for (int x = 0; x < xSize; x++) {
     for (int y = 0; y < ySize; y++)  {
         pool.schedule(x, y);  // this will add all tasks to the pool's work queue
     }
}

pool.waitTilFinished(); // wait until all scheduled tasks are complete

EDIT: Michael Meadows mentioned in a comment that plinq may be a suitable alternative: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx

edited May 23 '17 at 10:27

Community

1
1

answered Mar 31 '09 at 17:48

Evan Teran

87,561
32
179
238

plinq may be a suitable alternative: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx – Michael Meadows Mar 31 '09 at 17:57
Surely the schedule method should not block -- merely add the work item to the pool's work queue? Otherwise you would add n items, block until they are all finished, add another n, &c. Conceivably the pool queue could have an error threshold, whereby, but this should much larger than n. – Paul Ruane Mar 31 '09 at 17:59
@Paul: you are absolutely right, i'll edit to provide a more realistic usage. – Evan Teran Mar 31 '09 at 18:04
Using a thread pool for work which obviously fits a work stealing framework (such as PLinq or Task Parallel Lib) or OpenMP is a very poor idea. A thread pool will work considerably slower if the system is busy or there is only one processor available. – codekaizen Apr 01 '09 at 21:26
you could always get the number of CPUs to pick the optimal number of threads. In addition to this, openMP uses pthreads under the hood which will likely have similar performance to a correctly used pool. – Evan Teran Apr 02 '09 at 02:06

score 3 · Answer 5 · answered Mar 31 '09 at 17:50

I'd recommend running a few empty tests to figure out what your theoretical bounds are. For example, take out the calculation from inside the loop and see how much time is saved. Try replacing the double loop with a single loop that runs the same number of times and see how much time that saves. Then you can be sure you are going down the right path for optimization (the two paths I see are flattening the double loop into a single loop and working with the multiplication [maybe using a lookup table would be faster]).

torial · Answer 6 · 2009-03-31T18:02:38.553

3

Just real quick, you can get an optimization by looping in reverse and comparing against 0. Most CPUs have a fast op for comparison to 0.

E.g.

int xSize = ResultImageData.GetLength(0) -1;
int ySize = ResultImageData.GetLength(1) -1; //minor optimization suggested by commenter

for (int x = xSize; x >= 0; --x)
{                
     for (int y = ySize; y >=0; --y) 
     {                                                
          ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
                                         (AlphaImageData[x, y] * OneMinusAlphaValue));
     }
}

See http://dotnetperls.com/Content/Decrement-Optimization.aspx

edited Mar 31 '09 at 18:02

answered Mar 31 '09 at 17:55

torial

13,085
9
62
89

Nit picky: Why not just set "xSize - 1" when you declare it, to avoid having to redo that calculation a couple thousand times? – Dan Lew Mar 31 '09 at 17:58
Done. Might as well save that op. Good catch :-) – torial Mar 31 '09 at 18:03

H H · Answer 7 · 2009-03-31T21:38:44.460

You are probably suffering from Boundschecking. Like Jon Skeet states, a jagged array instead of a multidimensional (that is data[][] instead of data[,]) will be faster, strange as that may seem.

The compiler will optimize

for (int i = 0; i < data.Length; i++)

by eliminating the per-element range check. But it's some kind of special case, it won't do the same for Getlength().

For the same reason, caching or hoisting the Length property (putting it in a variable like xSize) also used to be a bad thing though I haven't been able to verify that with Framework 3.5

score 2 · Answer 8 · answered Apr 01 '09 at 16:52

2

Try swapping the x and y for loops for a more linear memory access pattern and (thus) less cache misses, like so.

int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);

for (int y = 0; y < ySize; y++) 
{
    for (int x = 0; x < xSize; x++)
    {
        ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
            (AlphaImageData[x, y] * OneMinusAlphaValue));
    }
}

answered Apr 01 '09 at 16:52

Jasper Bekkers

6,711
32
46

I was about to recommend exactly that. +1 – qwerty Apr 01 '09 at 17:03

score 1 · Answer 9 · answered Mar 31 '09 at 17:43

1

If you are using LockBits to get at the image buffer, you should loop through y in the outer loop and x in the inner loop as that is how it is stored in memory (by row, not column). I would say that 11ms is pretty darn fast though...

answered Mar 31 '09 at 17:43

Ed S.

122,712
22
185
265

I don't think this will *actually* work - the array is being used with y as the "minor" co-ordinate, so regardless of the source, I believe in memory it will be [0,0], [0, 1], [0, 2] etc - which is how it's being iterated. – Jon Skeet Mar 31 '09 at 17:45

score 1 · Answer 10 · answered Mar 31 '09 at 17:43

1

Does the image data have to be stored in a multi-dimensional (rectangular) array? If you use jagged arrays instead, you may well find the JIT has more optimizations available (including removing the bounds checking).

answered Mar 31 '09 at 17:43

Jon Skeet

1,421,763
867
9,128
9,194

Jon, is there an efficient way to get data from a MD array into a jagged array? I've found that jagged arrays are ~1ms faster in the loop. But it takes much longer than that to loop thru the MD array and copy each value into the jagged array. – Matt Warren Apr 01 '09 at 14:40
Also I can't put the data into a jagged array in the first place as the 3rd party lib that gets the data into .NET only offers an options of putting the data into a MD array. – Matt Warren Apr 01 '09 at 14:44
Right - in that case this answer isn't helpful to you :( I'll leave it up in case anyone else is in a similar situation but has the option of a jagged array. – Jon Skeet Apr 01 '09 at 14:54

Les · Answer 11 · 2009-03-31T18:08:16.730

If CurrentImageData and/or AlphaImageData don't change every time you run your code snippet, you could store the product prior to running the code snippet you show and avoid that multiplication in your loops.

Edit: Another thing I just thought of: Sometimes int operations are quicker than byte operations. Offset this with your processor cache utilization (you'll increase the data size considerably and stand a greater risk of a cache miss).

score 1 · Answer 12 · answered Mar 31 '09 at 19:18

442,368 additions and 884,736 multiplications for the calculation i would think 11ms is actually extremely slow on a modern CPU.

while i don't know much about the specifics of .net i do know high speed calculation is not its strong suit. In the past i've built java apps with similar problems, i've always used C libraries to do the image / audio processing.

coming from a hardware perspective you want to make sure the memory accesses are sequential, that is step through the buffer in the order it exists in memory. you also may need to reorder this such that the compiler takes advantage of available instructions such as SIMD. How to approach this will end up being dependent on your compiler and i can't help on vs.net.

on an embedded DSP i would break out

(AlphaImageData[x, y] * OneMinusAlphaValue) and (CurrentImageData[x, y] * AlphaValue) and use SIMD instructions to calculate buffers, possibly in parallel before performing the addition. perhaps doing small enough chunks to keep the buffers in cache on the cpu.

i believe anything you do will require more direct access to the memory/cpu than .net allows.

.NET allows more direct access - see answers about unsafe blocks — XOR, Mar 31 '09 at 20:10

score 1 · Answer 13 · answered Apr 01 '09 at 07:17

You may also want to take a look at the Mono runtime and its Simd extensions. Perhaps some of your calculations can make use of the SSE acceleration as I gather that you basically do vector calculations (I don't know up to which vector size there is acceleration for multiplication but there is for some sizes)

(Blog post announcing Mono.Simd: http://tirania.org/blog/archive/2008/Nov-03.html)

Of course, that wouldn't work on Microsoft .NET but maybe you are interested in some experimentation.

Thanks for the info, but I don't think I want to go as far as adding SSE acceleration at the moment, but I'll bear it in mind. — Matt Warren, Apr 01 '09 at 16:54

aronchick · Answer 14 · 2009-04-01T21:12:05.720

Interestingly, image data is frequently pretty similar, meaning that the calculations are likely very repetitive. Have you explored doing a lookup table for the calculations? So any time 0.8 was multiplied by 128 - value[80,128] which you've precalculated to 102.4, you simply looked that up? You're basically trading memory space for CPU speed, but it could work for you.

Of course, if your image data has too high a resolution (and goes to too significant a digit), this may not be practical.

Can this code be optimised?

14 Answers14

Linked