What is the most performant method for setting a subset of sequential items in a large array?

Question

I am working with camera streams. I bring in 1,228,800 bytes per frame, so efficiency is crucial and nanoseconds per byte add up quickly.

I've come up with some example code below to describe the problem as succinctly as possible without seeming too contrived.

There are plenty of inefficiencies in this example code such as defining variables inside of loops, or dividing the brightness value instead of just using a composite value. These aren't my concern, and are just there to make the example simpler.

What I need advice on is the most performant method in C# for setting 3 sequential values at some determined location in a very large array, such as in the case below where I'm setting BGR to 255 while skipping the 4th byte.

Edit: To clarify, the concerning issue is where I'm reindexing Output for each index that is being set. It seems like there should potentially be some method for not traversing the entire array for each value if I already have the location of the previous item.

    // Colors are stored as 4 bytes: BGRX where X is always 0
    public byte[] Input = new byte[640 * 480 * 4];
    public byte[] Output = new byte[640 * 480 * 4];

    public int Threshold = 180;

    public void ProcessImage() {
        for (int i = 0; i < Input.Length; i += 4) {
            var brightness = (Input[i] + Input[i + 1] + Input[i + 2]) / 3; // some value under 255

            if (brightness > Threshold) {
                // What is the most efficient way possible to do this?
                Output[i] = 255 - Input[i];
                Output[i + 1] = 255 - Input[i + 1];
                Output[i + 2] = 255 - Input[i + 2];
            }
            else {
                Output[i] = Input[i];
                Output[i + 1] = Input[i + 1];
                Output[i + 2] = Input[i + 2];
            }
        }
    }

Figuring out how to remove the (difficult to predict) branch in the loop (`if (brightness > Threshold)`) would probably yield the most efficiency. — spender, Jun 17 '16 at 11:56
That's not how you calculate brightness. If you can't get the camera to spit out a bitmap so you can use a built-in .NET class like ColorMatrix then use an image processing library like Emgu CV. — Hans Passant, Jun 17 '16 at 11:56
As I said in my question, there are plenty of inefficiencies in the code in order to make the example as simple as possible. Please answer the question rather than critiquing code I already identified as contrived. — Jim Yarbro, Jun 17 '16 at 11:58
The obvious way is to go `unsafe` and use [pointers](http://stackoverflow.com/a/537722/15498). Because for one, every access to `Output` is going to require a bounds check, since although *you* know that `Input` and `Output` are the same size, the code can't assume that. Also, many of the `Input` accesses will also incur bounds checks since *you* know that the length is always a multiple of 4 and therefore that accessing data at `i+1` and `i+2` is safe, the code doesn't. — Damien_The_Unbeliever, Jun 17 '16 at 12:13
Maybe you can use the GPU to speed up the process. Try to use CUDAfy.net fot it (there are others, but is the one I have used). I think it will improve the performance significantly — Ignacio, Jun 17 '16 at 12:37
You know, I think I might just do that, thanks! It's unrelated to the question but definitely related to the overall project so thanks for dropping that comment here. — Jim Yarbro, Jun 17 '16 at 12:39

score 3 · Accepted Answer · answered Jun 17 '16 at 12:31

This (untested, and unsafe) code should be faster, if all you care about is speed:

public void ProcessImage()
{
    int ilength = Input.Length;
    Debug.Assert(ilength == Output.Length);
    Debug.Assert(ilength%4 == 0);
    unsafe
    {
        GCHandle pinned1 = GCHandle.Alloc(Input, GCHandleType.Pinned);
        byte* input = (byte*)pinned1.AddrOfPinnedObject();
        GCHandle pinned2 = GCHandle.Alloc(Input, GCHandleType.Pinned);
        byte* output = (byte*)pinned2.AddrOfPinnedObject();
        for (int i = 0; i < ilength; i += 4)
        {
            var brightness = (*(input) + *(input  + 1) + *(input + 2)) / 3; 
            if (brightness > Threshold)
            {
                // What is the most efficient way possible to do this?
                (*(output)) = (byte)(255 - *(input));
                (*(output+1)) = (byte)(255 - *(input+1));
                (*(output+2)) = (byte)(255 - *(input+2));
            }
            else
            {
                (*(output)) = *(input);
                (*(output + 1)) = *(input + 1);
                (*(output + 2)) = *(input + 2);
            }
            input += 4;
            output += 4;
        }

        pinned1.Free();
        pinned2.Free();
    }
}

Note that I've incorporate the necessary assumptions at the top of the function. I'd suggest you always do this, but whether you prefer Debug.Assert or some other form of validation is up to you.

Thanks! There's a couple new concepts in there for me to study so it will be a little time before I mark this as the answer or not. — Jim Yarbro, Jun 17 '16 at 12:41
This flies. I benchmarked a few thousand passes of the original code at 1142ms, this code at 444ms, and combined with a block copy so the else clause could be skipped, at 390ms. — spender, Jun 17 '16 at 12:47
I ended up using fixed() blocks instead of the GCHandles but if I understand correctly, that's just syntactic sugar. In any case, your answer led directly to me reducing lag by 3ms per frame currently and I'm not even done implementing throughout the project. Thanks! — Jim Yarbro, Jun 17 '16 at 14:04
@JimYarbro - yes, `fixed` is better. I'm coming down with something at the moment, so I'm only firing on two cylinders. — Damien_The_Unbeliever, Jun 17 '16 at 16:34

score 1 · Answer 2 · answered Jun 17 '16 at 12:42

If you're happy to carry the 4th byte through, it would be quicker to copy Input to Output first with a block copy, then not to perform the else clause of the branch:

    Buffer.BlockCopy(Input,0,Output,0,Input.Length);
    for (int i = 0; i < Input.Length; i += 4) {
        var brightness = (Input[i] + Input[i + 1] + Input[i + 2]) / 3;      
        if (brightness > Threshold) {
            Output[i] = (byte)(255 - Input[i]);
            Output[i + 1] =  (byte)(255 - Input[i + 1]);
            Output[i + 2] =  (byte)(255 - Input[i + 2]);
        }
    }

score 0 · Answer 3 · edited May 23 '17 at 12:31

In terms of the most performant way of setting a single value to multiple array indicies in c#, I think you're looking at it. There's no non-looping way to set the same value to multiple indicies. See How can I assign a value to multiple array indices at once without looping?

If it helps, there's no need for the else statement where you set the 3 indicies to 0. default(byte) is already zero, so every index in the Ouput[] array will initialize to 0.

As a side note, defining variables inside of loops vs outside of loops has no effect on the resulting IL. See Is it better to declare a variable inside or outside a loop?

EDIT: To add on to the comment above, you can use unsafe methods. See https://stackoverflow.com/a/5375552/3290789 and http://www.gutgames.com/post/Using-Unsafe-Code-for-Faster-Image-Manipulation.aspx

Thanks, but assigning the same value to multiple indices isn't really my concern. Perhaps this example really was too contrived to begin with since it seems to throw people off so easily. I edited the question to provide clarification, thanks. — Jim Yarbro, Jun 17 '16 at 12:04

What is the most performant method for setting a subset of sequential items in a large array?

3 Answers3