C vs assembler vs NEON performance

Question

I am working on an iPhone application that does real time image processing. One of the earliest steps in its pipeline is to convert a BGRA image to greyscale. I tried several different methods and the difference in timing results is far greater than I had imagined possible. First I tried using C. I approximate the conversion to luminosity by adding B+2*G+R /4

void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();

uchar *pOut = imByte.data;
for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{
    unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
    pOut[0] = sumA / 4;
    unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
    pOut[1] = sumB / 4;
    unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
    pOut[2] = sumC / 4;
    unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
    pOut[3] = sumD / 4;
    pOut +=4;
}       
}

This code takes 55 ms to convert a 352x288 image. I then found some assembler code that does essentially the same thing

void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();

unsigned int *pOut = (unsigned int*) imByte.data;

for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{
  register unsigned int nBGRA1 asm("r4");
  register unsigned int nBGRA2 asm("r5");
  unsigned int nZero=0;
  unsigned int nSum1;
  unsigned int nSum2;
  unsigned int nPacked1;
  asm volatile(
           
               "ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #0]       \n"   // Load in two BGRA words
               "usad8 %[nSum1], %[nBGRA1], %[nZero]  \n"  // Add R+G+B+A 
               "usad8 %[nSum2], %[nBGRA2], %[nZero]  \n"  // Add R+G+B+A 
               "uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8    \n"   // Add G again
               "uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8    \n"   // Add G again
               "mov %[nPacked1], %[nSum1], LSR #2 \n"    // Init packed word   
               "mov %[nSum2], %[nSum2], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum2], LSL #8 \n"   // Add to packed word                 

               "ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #8]       \n"   // Load in two more BGRA words
               "usad8 %[nSum1], %[nBGRA1], %[nZero]  \n"  // Add R+G+B+A 
               "usad8 %[nSum2], %[nBGRA2], %[nZero]  \n"  // Add R+G+B+A 
               "uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8    \n"   // Add G again
               "uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8    \n"   // Add G again
               "mov %[nSum1], %[nSum1], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum1], LSL #16 \n"   // Add to packed word
               "mov %[nSum2], %[nSum2], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum2], LSL #24 \n"   // Add to packed word                 
              
               ///////////
               ////////////
               
               : [pIn]"+r" (pIn), 
         [nBGRA1]"+r"(nBGRA1),
         [nBGRA2]"+r"(nBGRA2),
         [nZero]"+r"(nZero),
         [nSum1]"+r"(nSum1),
         [nSum2]"+r"(nSum2),
         [nPacked1]"+r"(nPacked1)
               :
               : "cc"  );
  *pOut = nPacked1;
  pOut++;
 }
 }

This function converts the same image in 12ms, almost 5X faster! I have not programmed in assembler before but I assumed that it would not be this much faster than C for such a simple operation. Inspired by this success I continued searching and discovered a NEON conversion example here.

void greyScaleNEON(uchar* output_data, uchar* input_data, int tot_pixels)
{
__asm__ volatile("lsr          %2, %2, #3      \n"
                 "# build the three constants: \n"
                 "mov         r4, #28          \n" // Blue channel multiplier
                 "mov         r5, #151         \n" // Green channel multiplier
                 "mov         r6, #77          \n" // Red channel multiplier
                 "vdup.8      d4, r4           \n"
                 "vdup.8      d5, r5           \n"
                 "vdup.8      d6, r6           \n"
                 "0:                           \n"
                 "# load 8 pixels:             \n"
                 "vld4.8      {d0-d3}, [%1]!   \n"
                 "# do the weight average:     \n"
                 "vmull.u8    q7, d0, d4       \n"
                 "vmlal.u8    q7, d1, d5       \n"
                 "vmlal.u8    q7, d2, d6       \n"
                 "# shift and store:           \n"
                 "vshrn.u16   d7, q7, #8       \n" // Divide q3 by 256 and store in the d7
                 "vst1.8      {d7}, [%0]!      \n"
                 "subs        %2, %2, #1       \n" // Decrement iteration count
                 "bne         0b            \n" // Repeat unil iteration count is not zero
                 :
                 :  "r"(output_data),           
                 "r"(input_data),           
                 "r"(tot_pixels)        
                 : "r4", "r5", "r6"
                 );
}

The timing results were hard to believe. It converts the same image in 1 ms. 12X faster than assembler and an astounding 55X faster than C. I had no idea that such performance gains were possible. In light of this I have a few questions. First off, am I doing something terribly wrong in the C code? I still find it hard to believe that it is so slow. Second, if these results are at all accurate, in what kinds of situations can I expect to see these gains? You can probably imagine how excited I am at the prospect of making other parts of my pipeline run 55X faster. Should I be learning assembler/NEON and using them inside any loop that takes an appreciable amount of time?

Update 1: I have posted the assembler output from my C function in a text file at http://temp-share.com/show/f3Yg87jQn It was far too large to include directly here.

Timing is done using OpenCV functions.

double duration = static_cast<double>(cv::getTickCount()); 
//function call 
duration = static_cast<double>(cv::getTickCount())-duration;
duration /= cv::getTickFrequency();
//duration should now be elapsed time in ms

Results

I tested several suggested improvements. First, as recommended by Viktor I reordered the inner loop to put all fetches first. The inner loop then looked like.

for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{     
  //Jul 16, 2012 MR: Read and writes collected
  sumA = pIn[0] + 2 * pIn[1] + pIn[2];
  sumB = pIn[4] + 2 * pIn[5] + pIn[6];
  sumC = pIn[8] + 2 * pIn[9] + pIn[10];
  sumD = pIn[12] + 2 * pIn[13] + pIn[14];
  pOut +=4;
  pOut[0] = sumA / 4;
  pOut[1] = sumB / 4;
  pOut[2] = sumC / 4;
  pOut[3] = sumD / 4;
}

This change brought processing time down to 53ms an improvement of 2ms. Next as recommended by Victor I changed my function to fetch as uint. The inner loop then looked like

unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
uchar* out = temp.data;

for(; in_int < end; in_int+=4)   // Does four pixels at a time
{
    unsigned int pixelA = in_int[0];
    unsigned int pixelB = in_int[1];
    unsigned int pixelC = in_int[2];
    unsigned int pixelD = in_int[3];
        
    uchar* byteA = (uchar*)&pixelA;
    uchar* byteB = (uchar*)&pixelB;
    uchar* byteC = (uchar*)&pixelC;
    uchar* byteD = (uchar*)&pixelD;         
        
    unsigned int sumA = byteA[0] + 2 * byteA[1] + byteA[2];
    unsigned int sumB = byteB[0] + 2 * byteB[1] + byteB[2];
    unsigned int sumC = byteC[0] + 2 * byteC[1] + byteC[2];
    unsigned int sumD = byteD[0] + 2 * byteD[1] + byteD[2];

    out[0] = sumA / 4;
    out[1] = sumB / 4;
    out[2] = sumC / 4;
    out[3] = sumD / 4;
    out +=4;
    }

This modification had a dramatic effect, dropping processing time to 14ms, a drop of 39ms (75%). This last result is very close the the assembler performance of 11ms. The final optimization as recommended by rob was to include the __restrict keyword. I added it in front of every pointer declaration changing the following lines

__restrict unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
__restrict uchar* out = temp.data;  
...
__restrict uchar* byteA = (uchar*)&pixelA;
__restrict uchar* byteB = (uchar*)&pixelB;
__restrict uchar* byteC = (uchar*)&pixelC;
__restrict uchar* byteD = (uchar*)&pixelD;  
...

These changes had no measurable effect on processing time. Thank you for all your help, I will be paying much closer attention to memory management in the future.

Err.. please correct me if I'm wrong, but NEON runs on separate DSP array hardware? If so, maybe that could explain why it's a tad faster? — Martin James, Jul 16 '12 at 16:14
Same hardware, same chip. It's an instruction-set extension just like SSE or MMX. — Nils Pipenbrinck, Jul 16 '12 at 16:18
@harold I went to get the compiler-generated assembler but it is absurdly large. I am in xcode 4.3.2 and went to view the assembly in the assistant editor. My function in C is 23 lines long with no includes but the assembler generated is 9029 lines long. Any idea why that would be? I am letting the compiler maximize optimizations on the C code. — Hammer, Jul 16 '12 at 16:34
I'm sorry I am completely new to assembler. I assumed that if i made a new .cpp file which only contains a function and no includes the generated assembly would only reflect the contents of the function. Is that a good assumption? Is there a better way to isolate the relevant assembler? — Hammer, Jul 16 '12 at 16:42
@Hammer: Can you tell us how did you make these measurement? Thanks — A_nto2, Jul 16 '12 at 16:48
@harold, I included a link to the assembler output in the question — Hammer, Jul 16 '12 at 16:52
@A_nto2 I am using openCV for much of this project. They have function calls that are useful for timing. The sequence to time a function generally goes. duration = static_cast(cv::getTickCount()); function call duration = static_cast(cv::getTickCount())-duration; duration /= cv::getTickFrequency(); // the elapsed time in ms — Hammer, Jul 16 '12 at 16:54
Ok I had a look at that file, the code is in there, approximately from line 72 to line 159. The rest is just .. stuff. — harold, Jul 16 '12 at 17:00
If you really want fast processing of video, may I suggest looking to OpenGL ES 2.0 fragment shaders? The GPU is even faster at handling simple parallel operations than NEON-optimized routines. I wrote an open source framework around this: https://github.com/BradLarson/GPUImage and I have seen over 100X speedups in image processing on the iOS GPUs vs. CPU-bound C functions. I've replicated some of OpenCV's functionality, and it's my goal to bring over as much of that as I can. — Brad Larson, Jul 16 '12 at 18:05
@BradLarson As far as I know there is no good way of getting information back out of the GPU once it goes in. Is that true? Converting to greyscale in this case is just the first step in a bigger processing pipeline. I would love to use the GPU but it was my understanding that it was only useful as the final step in a pipeline. — Hammer, Jul 16 '12 at 18:55
@Hammer - No, you're perfectly capable of extracting data once processed. In the above-linked framework, I have a class for giving you raw bytes for your processed image in either RGBA or BGRA (slightly faster) format. All of this processing can be done offscreen, so you could replace slow portions of your pipeline with GPU-bound elements. There is some overhead with going to and from the GPU, but for raw bytes it tends to be less than the savings due to computation on images larger than 320x240 or so. — Brad Larson, Jul 16 '12 at 19:14
Re the new "Results" section: looks like the big win might be from more efficient data loading. If you can load a 128-bit data word instead of 4x32-bit words, you may get an additional performance boost; look in the ARM NEON intrinsics manual for the datatypes: http://www.arm.com/files/pdf/neon_support_in_the_arm_compiler.pdf — comingstorm, Jul 17 '12 at 01:24
@comingstorm how could I load a 128 bit word all at once? Just make a char[] and memcpy the data in? Is memcpy as fast as assignment? I don't think there are any 128 bit basic types I could use like I used uint above. Thanks — Hammer, Jul 17 '12 at 15:30
Just to clarify, I mean how could I do it in C, it seems like a potential improvement. — Hammer, Jul 17 '12 at 16:39
The white paper linked in my comment (which explains the ARM NEON intrinsics) mentions an include file `"arm_neon.h"` that defines a type `uint8x16_t`, which is probably what you want. — comingstorm, Jul 17 '12 at 17:11

Viktor Latypov · Accepted Answer · 2012-07-16T16:46:41.127

5

There is an explanation here concerning some of the reasons for NEON's "success": http://hilbert-space.de/?p=22

Try compiling you C code with the "-S -O3" switches to see the optimized output of the GCC compiler.

IMHO, the key to success is the optimized read/write pattern employed by both assembly versions. And NEON/MMX/other vector engines also support saturation (clamping results to 0..255 without having to use the 'unsigned ints').

See these lines in the loop:

unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;

The reads and writes are really mixed. Slightly better version of the loop's cycle would be

// and the pIn reads can be combined into a single 4-byte fetch
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;

Keep in mind, that the "unsigned in sumA" line here can really mean the alloca() call (allocation on the stack), so you're wasting a lot of cycles on the temporary var allocations (the function call 4 times).

Also, the pIn[i] indexing does only a single-byte fetch from memory. The better way to do this is to read the int and then extract single bytes. To make things faster, use the "unsgined int*" to read 4 bytes (pIn[i * 4 + 0], pIn[i * 4 + 1], pIn[i * 4 + 2], pIn[i * 4 + 3]).

The NEON version is clearly superior: the lines

             "# load 8 pixels:             \n"
             "vld4.8      {d0-d3}, [%1]!   \n"

and

             "#save everything in one shot   \n"
             "vst1.8      {d7}, [%0]!      \n"

save most of the time for the memory access.

edited Jul 16 '12 at 16:46

answered Jul 16 '12 at 16:39

Viktor Latypov

14,289
3
40
55

in other words - bad code can't be fast - if you want to write fast programs in C you really need to know the underlying assembler, otherwise you're screwed ... – rezna Jul 16 '12 at 16:46
2

C is a cross-platform assembler :) . I don't think that assembly is _absolutely_ required, but knowing the way things work and where the bottlenecks might be certainly helps a lot. And here the memory access is almost always slower then arithmetics. – Viktor Latypov Jul 16 '12 at 16:49
1

Also, it should be emphasized that NEON operations work on vectors, so you often get a significant speedup just from their SIMD nature and the dedicated hardware that the ARMv6 and ARMv7 processors have for running these parallel operations. I've regularly seen reports of 3-4X speedups for NEON implementations of something that had been done before in standard hand-tuned ARM assembly. – Brad Larson Jul 16 '12 at 17:55
@BradLarson: Of course, my "slightly better" version is far from being optimal, but that last Multiply-Add (four lines) can obviously be done in a single (or just a few) instruction involving only registers with a decent SIMD architecture. – Viktor Latypov Jul 16 '12 at 18:01
@ViktorLatypov thank you, your suggestions dramatically improved the performance of my C code. I will be applying those principles to other places in my pipeline and then examining which ones could still benefit from using NEON. I obviously have a lot to learn about memory management and efficiency. – Hammer Jul 16 '12 at 19:01
@Hammer: Thank you for the response. Actually, I've spent last week profiling some 3D volumetric processing code and the "memory is a bottleneck" principle is just too deep within my head right now :) There was almost a 100x performance improvement after a couple seems-to-be-obvious optimizations. – Viktor Latypov Jul 16 '12 at 19:06
@ViktorLatypov could sumA = pIn[0] + 2 * pIn[1] + pIn[2]; be faster if I use bit shifting instead of multiplication or using FMA to perform the "2 * pIn[1] + pIn[2]" part? – greenfox Dec 28 '14 at 18:32

comingstorm · Answer 2 · 2012-07-16T21:10:50.320

If performance is critically important (as it generally is with real-time image processing), you do need to pay attention to the machine code. As you have discovered, it can be especially important to use the vector instructions (which are designed for things like real-time image processing) -- and it is hard for compilers to automatically use the vector instructions effectively.

What you should try, before committing to assembly, is using compiler intrinsics. Compiler intrinsics aren't any more portable than assembly, but they should be easier to read and write, and easier for the compiler to work with. Aside from maintainability problems, the performance problem with assembly is that it effectively turns off the optimizer (you did use the appropriate compiler flag to turn it on, right?). That is: with inline assembly, the compiler is not able to tweak register assignment and so forth, so if you don't write your entire inner loop in assembly, it may still not be as efficient as it could be.

However, you will still be able to use your newfound assembly expertise to good effect -- as you can now inspect the assembly produced by your compiler, and figure out if it's being stupid. If so, you can tweak the C code (perhaps doing some pipelining by hand if the compiler isn't managing to), recompile it, look at the assembly output to see if the compiler is now doing what you want it to, then benchmark to see if it's actually running any faster...

If you've tried the above, and still can't provoke the compiler to do the right thing, go ahead and write your inner loop in assembly (and, again, check to see if the result is actually faster). For reasons described above, be sure to get the entire inner loop, including the loop branch.

Finally, as others have mentioned, take some time to try and figure out what "the right thing" is. Another benefit of learning your machine architecture is that it gives you a mental model of how things work -- so you will have a better chance of understanding how to put together efficient code.

Thank you, I will be using your procedure as I examine the rest of my pipeline to look for potential optimizations. — Hammer, Jul 16 '12 at 19:02

score 3 · Answer 3 · answered Jul 16 '12 at 17:46

3

Viktor Latypov's answer has lots of good information, but I want to point out one more thing: in your original C function, the compiler can't tell that pIn and pOut point to non-overlapping regions of memory. Now look at these lines:

pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];

The compiler has to assume that pOut[0] might be the same as pIn[4] or pIn[5] or pIn[6] (or any other pIn[x]). So it basically can't reorder any of the code in your loop.

You can tell the compiler that pIn and pOut don't overlap by declaring them __restrict:

__restrict uchar *pIn = (uchar*) imBGRA.data;
__restrict uchar *pOut = imByte.data;

This might speed up your original C version a bit.

answered Jul 16 '12 at 17:46

rob mayoff

375,296
67
796
848

As explained in the update in the question I tried adding the restrict keyword but it did not produce much of an effect. Was I using it correctly? Is it a good idea to always use it if I know my pointers are non-overlapping? – Hammer Jul 16 '12 at 18:50
1

The point of adding `restrict` was to let the compiler reorder loads and stores. You already did that manually, so adding `restrict` didn't let the compiler do anything new. I only use `restrict` when the profiler tells me I need to make a function run faster. I don't just sprinkle my code with it. – rob mayoff Jul 16 '12 at 19:00

score 0 · Answer 4 · answered Jul 16 '12 at 16:14

0

This is kind of a toss up between performance and maintainability. Typically have an app load and function quickly is very nice for the user, but there is the trade off. Now your app is fairly difficult to maintain and the speed gains may be unwarranted. If the users of your app were complaining that it felt slow then these optimizations are worth the effort and lack of maintainability, but if it came from your need to speed up your app then you should not go this far into the optimization. If you are doing these images conversion at app startup then speed is not of the essence, but if you are constantly doing them ( and doing a lot of them ) while the app is running then they make more sense. Only optimize the parts of the app where the user spends time and actually experiences the slow down.

Also looking at the assembly they do not use division but rather only multiplications so look into that for your C code. Another instance is that it optimizes out your multiplication by 2 out to two additions. This again may be another trick as the multiplication may be slower on a iPhone application than an addition.

answered Jul 16 '12 at 16:14

sean

3,955
21
28

If the compiler isn't optimizing these trivial divisions then it is an idiot and the writers should feel bad. – harold Jul 16 '12 at 16:19
That is a fair point, then again we don't have a assembly output of the OP's C code, so it is just an assumption of mine. – sean Jul 16 '12 at 16:21
Maybe also Debug vs. Release builds could be an issue. The compiler won't modify the asm volatile during the Debug session - but it could well affect the C code. – cli_hlt Jul 16 '12 at 16:27
I made sure to compile is release mode for every test. Also, the application does real time texture tracking and the faster it runs, the more robust tracking is so optimization is of utmost importance. – Hammer Jul 16 '12 at 16:47
2

Neat post; the question is: does the customer notice the 54ms delta? If you have lots of pics, then I suppose yes. On the other hand, the C code is easier to read. On the other other hand, the C code isn't actually that much easier to read. – Yusuf X Jul 16 '12 at 17:23
+1 just for 'On the other hand, the C code is easier to read. On the other other hand, the C code isn't actually that much easier to read' LOL! – Martin James Jul 16 '12 at 18:09

C vs assembler vs NEON performance

Results

4 Answers4