I am working on a program that removes a user's background in real-time using a webcam.
To do the background subtraction, I am using an ML model. The input of this model is a 4D tensor in the shape BCHW (batch, channel, height, width), where the RGB values range between 0 and 1. I am only using one image at the same time, so batch = 1, and it effectively makes it a 3D tensor.
So, I have to write some code to transform a byte array, the input image from the webcam, to the tensor.
Say you have an byte[]
like this: RGB RGB RGB RGB RGB RGB RGB RGB RGB
,
then the code should transform it to: RRRRRRRRR GGGGGGGGG BBBBBBBBB
.
To do this I have come up with the following method:
public Tensor<float> ImageToTensor(byte[] pixels, int height, int width)
{
var tensorSize = 1 * 3 * height * width;
float[] data = new float[tensorSize];
var channelSize = width * height;
var redChannel = 0;
var greenChannel = channelSize;
var blueChannel = (channelSize * 2);
Parallel.For(0, height, y =>
{
int rowStart = y * width * 3;
int rowWidth = width * 3;
var row = new Span<byte>(pixels, rowStart, rowWidth);
for (int i = 0; i < row.Length; i += 3)
{
byte red = row[i];
byte green = row[i + 1];
byte blue = row[i + 2];
var index = i / 3 + y * width;
data[redChannel + index] = red / 255f;
data[greenChannel + index] = green / 255f;
data[blueChannel + index] = blue / 255f;
}
});
Memory<float> memory = new Memory<float>(data);
return new DenseTensor<float>(memory, new[] { 1, 3, height, width });
}
I have benchmarked this with an 1920 x 1080 image using Benchmarkdotnet, and it takes about 5 ms.
While this is not a lot, the ML model takes about 8 ms, and I also have a post-processing step.
The goal is to run the video stream at 60 FPS. This means the whole process can only take 16.6ms. Since the ML model takes 8ms, I only have about 8 to 9 ms left for the pre-processing and post-processing steps.
The post and pre-processing steps are roughly the same, but the post-processing step is a bit more complex and takes about double the time. So, I am currently at 5ms + 8ms + 10ms = 23ms
.
That's 6 - 7ms too much.
So, my question is, how do I optimize the code above? I have already tried multiple things, but nothing really seems to improve it by a lot.
Improved code:
public Tensor<float> ImageToTensor(byte[] pixels, int height, int width)
{
var tensorSize = 1 * 3 * height * width;
float[] data = new float[tensorSize];
var channelSize = width * height;
var redChannel = 0;
var greenChannel = channelSize;
var blueChannel = (channelSize * 2);
int floatSlots = Vector<float>.Count;
Parallel.For(0, height, y =>
{
int rowStart = y * width * 3;
int rowWidth = width * 3;
var row = new Span<byte>(pixels, rowStart, rowWidth);
int numOfVectors = row.Length / (floatSlots * 3);
int ceiling = numOfVectors * (floatSlots * 3);
var reds = new float[floatSlots];
var greens = new float[floatSlots];
var blues = new float[floatSlots];
for (int i = 0; i < ceiling; i += (floatSlots * 3))
{
for (int j = 0; j < (3 * floatSlots); j += 3)
{
reds[j / 3] = row[j] / 255f;
greens[j / 3] = row[j + 1] / 255f;
blues[j / 3] = row[j + 2] / 255f;
}
var vecRed = new Vector<float>(reds);
var vecGreen = new Vector<float>(greens);
var vecBlue = new Vector<float>(blues);
vecRed.CopyTo(data, i + redChannel);
vecGreen.CopyTo(data, i + greenChannel);
vecBlue.CopyTo(data, i + blueChannel);
}
for (int i = ceiling; i < row.Length; i += 3)
{
byte red = row[i];
byte green = row[i + 1];
byte blue = row[i + 2];
var index = i / 3 + y * width;
data[redChannel + index] = red / 255f;
data[greenChannel + index] = green / 255f;
data[blueChannel + index] = blue / 255f;
}
});
Memory<float> memory = new Memory<float>(data);
return new DenseTensor<float>(memory, new[] { 1, 3, height, width });
}
I have improved my code by using SIMD Vectors and this made the code 20% faster. It went from 5ms to 4ms.