How to optimize a method that transforms a byte[] to a 4d tensor (or array)?

Question

I am working on a program that removes a user's background in real-time using a webcam.

To do the background subtraction, I am using an ML model. The input of this model is a 4D tensor in the shape BCHW (batch, channel, height, width), where the RGB values range between 0 and 1. I am only using one image at the same time, so batch = 1, and it effectively makes it a 3D tensor.

So, I have to write some code to transform a byte array, the input image from the webcam, to the tensor.

Say you have an byte[] like this: RGB RGB RGB RGB RGB RGB RGB RGB RGB,

then the code should transform it to: RRRRRRRRR GGGGGGGGG BBBBBBBBB.

To do this I have come up with the following method:

public Tensor<float> ImageToTensor(byte[] pixels, int height, int width)
{
    var tensorSize = 1 * 3 * height * width;

    float[] data = new float[tensorSize];

    var channelSize = width * height;
    var redChannel = 0;
    var greenChannel = channelSize;
    var blueChannel = (channelSize * 2);

    Parallel.For(0, height, y =>
    {
        int rowStart = y * width * 3;
        int rowWidth = width * 3;

        var row = new Span<byte>(pixels, rowStart, rowWidth);

        for (int i = 0; i < row.Length; i += 3)
        {
            byte red = row[i];
            byte green = row[i + 1];
            byte blue = row[i + 2];

            var index = i / 3 + y * width;

            data[redChannel + index] = red / 255f;
            data[greenChannel + index] = green / 255f;
            data[blueChannel + index] = blue / 255f;
        }
    });

    Memory<float> memory = new Memory<float>(data);

    return new DenseTensor<float>(memory, new[] { 1, 3, height, width });
}

I have benchmarked this with an 1920 x 1080 image using Benchmarkdotnet, and it takes about 5 ms.

While this is not a lot, the ML model takes about 8 ms, and I also have a post-processing step.

The goal is to run the video stream at 60 FPS. This means the whole process can only take 16.6ms. Since the ML model takes 8ms, I only have about 8 to 9 ms left for the pre-processing and post-processing steps.

The post and pre-processing steps are roughly the same, but the post-processing step is a bit more complex and takes about double the time. So, I am currently at 5ms + 8ms + 10ms = 23ms.

That's 6 - 7ms too much.

So, my question is, how do I optimize the code above? I have already tried multiple things, but nothing really seems to improve it by a lot.

Improved code:

public Tensor<float> ImageToTensor(byte[] pixels, int height, int width)
{
    var tensorSize = 1 * 3 * height * width;

    float[] data = new float[tensorSize];

    var channelSize = width * height;
    var redChannel = 0;
    var greenChannel = channelSize;
    var blueChannel = (channelSize * 2);

    int floatSlots = Vector<float>.Count;

    Parallel.For(0, height, y =>
    {
        int rowStart = y * width * 3;
        int rowWidth = width * 3;

        var row = new Span<byte>(pixels, rowStart, rowWidth);

        int numOfVectors = row.Length / (floatSlots * 3);
        int ceiling = numOfVectors * (floatSlots * 3);

        var reds = new float[floatSlots];
        var greens = new float[floatSlots];
        var blues = new float[floatSlots];

        for (int i = 0; i < ceiling; i += (floatSlots * 3))
        {
            for (int j = 0; j < (3 * floatSlots); j += 3)
            {
                reds[j / 3] = row[j] / 255f;
                greens[j / 3] = row[j + 1] / 255f;
                blues[j / 3] = row[j + 2] / 255f;
            }

            var vecRed = new Vector<float>(reds);
            var vecGreen = new Vector<float>(greens);
            var vecBlue = new Vector<float>(blues);

            vecRed.CopyTo(data, i + redChannel);
            vecGreen.CopyTo(data, i + greenChannel);
            vecBlue.CopyTo(data, i + blueChannel);
        }

        for (int i = ceiling; i < row.Length; i += 3)
        {
            byte red = row[i];
            byte green = row[i + 1];
            byte blue = row[i + 2];

            var index = i / 3 + y * width;

            data[redChannel + index] = red / 255f;
            data[greenChannel + index] = green / 255f;
            data[blueChannel + index] = blue / 255f;
        }
    });

    Memory<float> memory = new Memory<float>(data);

    return new DenseTensor<float>(memory, new[] { 1, 3, height, width });
}

I have improved my code by using SIMD Vectors and this made the code 20% faster. It went from 5ms to 4ms.

score 1 · Accepted Answer · answered May 10 '22 at 21:30

Your code looks fairly good, so I would not expect any miracles of performance, but there are a few things you could try.

Instead of incrementing i by 3, I would suggest incrementing by one and multiplying by 3 inside the method, that would let you get rid of the divide, and multiplication is a lot faster than division.

You could also consider SIMD intrinstics, that might help quite a bit, but I have no idea how you should write it. You might also consider looking for optimized implementations, like Intel performance Primitives or OpenCV. These should hopefully already be highly optimized, but might require a .Net wrapper or pInvoke calls. Even if you do not want to use a third party solution, it might be worth testing to get an idea of the upper bound of the performance.

You might also consider using c++ instead. While most .net code benefit little from a straight c++ port in my experience, the c++ compiler has a wider range of optimization available that might help for tight loops like this.

If nothing else works you might consider just throwing hardware at the problem. I.e. more and faster cores. This might be a cheaper then spending lots of time on optimization, at least if this is a specialized implementation not intended for mass-market.

In regards to incrementing `i` by 1 instead of 3, it really doesn't matter, JIT is smart enough to eliminate the division entirely. In fact, the only real division that the resulting assembly calls is `divss` for the float division, which could be avoided by multiplying by `1f / 255f` instead of dividing by `255f` — Petrusion, May 11 '22 at 01:22
Thanks for your suggestions. I have added an improved version of my code that uses SIMD Vectors. This made it 20% faster. It now takes 4ms instead of 5ms. I will leave it at that for now unless someone else might have an idea on how to improve it even further — Luuk Wuijster, May 11 '22 at 13:33

How to optimize a method that transforms a byte[] to a 4d tensor (or array)?

Improved code:

1 Answers1