I am working on an iPhone application that does real time image processing. One of the earliest steps in its pipeline is to convert a BGRA image to greyscale. I tried several different methods and the difference in timing results is far greater than I had imagined possible. First I tried using C. I approximate the conversion to luminosity by adding B+2*G+R /4
void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();
uchar *pOut = imByte.data;
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;
}
}
This code takes 55 ms to convert a 352x288 image. I then found some assembler code that does essentially the same thing
void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();
unsigned int *pOut = (unsigned int*) imByte.data;
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
register unsigned int nBGRA1 asm("r4");
register unsigned int nBGRA2 asm("r5");
unsigned int nZero=0;
unsigned int nSum1;
unsigned int nSum2;
unsigned int nPacked1;
asm volatile(
"ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #0] \n" // Load in two BGRA words
"usad8 %[nSum1], %[nBGRA1], %[nZero] \n" // Add R+G+B+A
"usad8 %[nSum2], %[nBGRA2], %[nZero] \n" // Add R+G+B+A
"uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8 \n" // Add G again
"uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8 \n" // Add G again
"mov %[nPacked1], %[nSum1], LSR #2 \n" // Init packed word
"mov %[nSum2], %[nSum2], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum2], LSL #8 \n" // Add to packed word
"ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #8] \n" // Load in two more BGRA words
"usad8 %[nSum1], %[nBGRA1], %[nZero] \n" // Add R+G+B+A
"usad8 %[nSum2], %[nBGRA2], %[nZero] \n" // Add R+G+B+A
"uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8 \n" // Add G again
"uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8 \n" // Add G again
"mov %[nSum1], %[nSum1], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum1], LSL #16 \n" // Add to packed word
"mov %[nSum2], %[nSum2], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum2], LSL #24 \n" // Add to packed word
///////////
////////////
: [pIn]"+r" (pIn),
[nBGRA1]"+r"(nBGRA1),
[nBGRA2]"+r"(nBGRA2),
[nZero]"+r"(nZero),
[nSum1]"+r"(nSum1),
[nSum2]"+r"(nSum2),
[nPacked1]"+r"(nPacked1)
:
: "cc" );
*pOut = nPacked1;
pOut++;
}
}
This function converts the same image in 12ms, almost 5X faster! I have not programmed in assembler before but I assumed that it would not be this much faster than C for such a simple operation. Inspired by this success I continued searching and discovered a NEON conversion example here.
void greyScaleNEON(uchar* output_data, uchar* input_data, int tot_pixels)
{
__asm__ volatile("lsr %2, %2, #3 \n"
"# build the three constants: \n"
"mov r4, #28 \n" // Blue channel multiplier
"mov r5, #151 \n" // Green channel multiplier
"mov r6, #77 \n" // Red channel multiplier
"vdup.8 d4, r4 \n"
"vdup.8 d5, r5 \n"
"vdup.8 d6, r6 \n"
"0: \n"
"# load 8 pixels: \n"
"vld4.8 {d0-d3}, [%1]! \n"
"# do the weight average: \n"
"vmull.u8 q7, d0, d4 \n"
"vmlal.u8 q7, d1, d5 \n"
"vmlal.u8 q7, d2, d6 \n"
"# shift and store: \n"
"vshrn.u16 d7, q7, #8 \n" // Divide q3 by 256 and store in the d7
"vst1.8 {d7}, [%0]! \n"
"subs %2, %2, #1 \n" // Decrement iteration count
"bne 0b \n" // Repeat unil iteration count is not zero
:
: "r"(output_data),
"r"(input_data),
"r"(tot_pixels)
: "r4", "r5", "r6"
);
}
The timing results were hard to believe. It converts the same image in 1 ms. 12X faster than assembler and an astounding 55X faster than C. I had no idea that such performance gains were possible. In light of this I have a few questions. First off, am I doing something terribly wrong in the C code? I still find it hard to believe that it is so slow. Second, if these results are at all accurate, in what kinds of situations can I expect to see these gains? You can probably imagine how excited I am at the prospect of making other parts of my pipeline run 55X faster. Should I be learning assembler/NEON and using them inside any loop that takes an appreciable amount of time?
Update 1: I have posted the assembler output from my C function in a text file at http://temp-share.com/show/f3Yg87jQn It was far too large to include directly here.
Timing is done using OpenCV functions.
double duration = static_cast<double>(cv::getTickCount());
//function call
duration = static_cast<double>(cv::getTickCount())-duration;
duration /= cv::getTickFrequency();
//duration should now be elapsed time in ms
Results
I tested several suggested improvements. First, as recommended by Viktor I reordered the inner loop to put all fetches first. The inner loop then looked like.
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
//Jul 16, 2012 MR: Read and writes collected
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;
}
This change brought processing time down to 53ms an improvement of 2ms. Next as recommended by Victor I changed my function to fetch as uint. The inner loop then looked like
unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
uchar* out = temp.data;
for(; in_int < end; in_int+=4) // Does four pixels at a time
{
unsigned int pixelA = in_int[0];
unsigned int pixelB = in_int[1];
unsigned int pixelC = in_int[2];
unsigned int pixelD = in_int[3];
uchar* byteA = (uchar*)&pixelA;
uchar* byteB = (uchar*)&pixelB;
uchar* byteC = (uchar*)&pixelC;
uchar* byteD = (uchar*)&pixelD;
unsigned int sumA = byteA[0] + 2 * byteA[1] + byteA[2];
unsigned int sumB = byteB[0] + 2 * byteB[1] + byteB[2];
unsigned int sumC = byteC[0] + 2 * byteC[1] + byteC[2];
unsigned int sumD = byteD[0] + 2 * byteD[1] + byteD[2];
out[0] = sumA / 4;
out[1] = sumB / 4;
out[2] = sumC / 4;
out[3] = sumD / 4;
out +=4;
}
This modification had a dramatic effect, dropping processing time to 14ms, a drop of 39ms (75%). This last result is very close the the assembler performance of 11ms. The final optimization as recommended by rob was to include the __restrict keyword. I added it in front of every pointer declaration changing the following lines
__restrict unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
__restrict uchar* out = temp.data;
...
__restrict uchar* byteA = (uchar*)&pixelA;
__restrict uchar* byteB = (uchar*)&pixelB;
__restrict uchar* byteC = (uchar*)&pixelC;
__restrict uchar* byteD = (uchar*)&pixelD;
...
These changes had no measurable effect on processing time. Thank you for all your help, I will be paying much closer attention to memory management in the future.