Coding for ARM NEON: How to start?

Question

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used in C++ environment?

I use Eclipse IDE in Linux Gentoo to write C++ code.

UPDATE

After reading the answers I did some tests with the software. I compiled my project with the following flags:

-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon

Keep in mind that this project includes extensive libraries such as open frameworks, OpenCV, and OpenNI, and everything was compiled with these flags.

To compile for the ARM board we use a Linaro toolchain cross-compiler, and GCC's version is 4.8.3.

Would you expect this to improve the performance of the project? Because we experienced no changes at all, which is rather weird considering all the answers I read here.

Another question: all the for cycles have an apparent number of iterations, but many of them iterate through custom data types (structs or classes). Can GCC optimize these cycles even though they iterate through custom data types?

Thanks for the answers everyone. Although this question was (understandably) down voted, because it isn't really a question, these answers offer a great insight to those who are starting to dig into this subject :) — Pedro Batista, Feb 17 '15 at 11:01
If you were compiling with `-O3`, then -ftree-vectorize would already be implied, so I wouldn't expect a change in performance. Additionally, if somebody has already written the code around the bottlenecks near-optimally, I wouldn't expect to see any improvement. The only way to truly tell is to look at some performance data for the code, find the hotspots, look at how they are currently implemented and what they currently compile to, and then take a decision as to whether you can do better. — James Greenhalgh, Feb 17 '15 at 12:34

score 15 · Answer 1 · edited May 23 '17 at 10:31

EDIT:

From your update, you may misunderstand what the NEON processor does. It is an SIMD (Single Instruction, Multiple Data) vector processor. That means that it is very good at performing an instruction (say "multiply by 4") to several pieces of data at the same time. It also loves to do things like "add all these numbers together" or "add each element of these two lists of numbers to create a third list of numbers." So if you problem looks like those things the NEON processor is going to be huge help.

To get that benefit, you must put your data in very specific formats so that the vector processor can load multiple data simultaneously, process it in parallel, and then write it back out simultaneously. You need to organize things such that the math avoids most conditionals (because looking at the results too soon means a roundtrip to the NEON). Vector programming is a different way of thinking about your program. It's all about pipeline management.

Now, for many very common kinds of problems, the compiler automatically can work all of this out. But it's still about working with numbers, and numbers in particular formats. For example, you almost always need to get all of your numbers into a contiguous block in memory. If you're dealing with fields inside of structs and classes, the NEON can't really help you. It's not a general-purpose "do stuff in parallel" engine. It's an SIMD processor for doing parallel math.

For very high-performance systems, data format is everything. You don't take arbitrary data formats (structs, classes, etc.) and try to make them fast. You figure out the data format that will let you do the most parallel work, and you write your code around that. You make your data contiguous. You avoid memory allocation at all costs. But this isn't really something a simple StackOverflow question can address. High-performance programming is a whole skill set and a different way of thinking about things. It isn't something you get by finding the right compiler flag. As you've found, the defaults are pretty good already.

The real question you should be asking is whether you could reorganize your data so that you can use more of OpenCV. OpenCV already has lots of optimized parallel operations that will almost certainly make good use of the NEON. As much as possible, you want to keep your data in the format that OpenCV works in. That's likely where you're going to get your biggest improvements.

My experience is that it is certainly possible to hand-write NEON assembly that will beat clang and gcc (at least from a couple of years ago, though the compiler certainly continues to improve). Having excellent ARM optimization is not the same as NEON optimization. As @Mats notes, the compiler will generally do an excellent job at obvious cases, but does not always handle every case ideally, and it is certainly possible for even a lightly skilled developer to sometimes beat it, sometimes dramatically. (@wallyk is also correct that hand-tuning assembly is best saved for last; but it can still be very powerful.)

That said, given your statement "Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point," then no, you should not even bother. Without first at least understanding the basics (and a few non-basics) of assembly (and specifically vectorized NEON assembly), there is no point in second-guessing the compiler. Step one of beating the compiler is knowing the target.

If you are willing to learn the target, my favorite introduction is Whirlwind Tour of ARM Assembly. That, plus some other references (below), were enough to let me beat the compiler by 2-3x in my particular problems. On the other hand, they were insufficient enough that when I showed my code to an experienced NEON developer, he looked at it for about three seconds and said "you have a halt right there." Really good assembly is hard, but half-decent assembly can still be better than optimized C++. (Again, every year this gets less true as the compiler writers get better, but it can still be true.)

ARM Assembly language
A few things iOS developers ought to know about the ARM architecture (iPhone-focused, but the principles are the same for all uses.)
ARM NEON support in the ARM compiler
Coding for NEON

One side note, my experience with NEON intrinsics is that they are seldom worth the trouble. If you're going to beat the compiler, you're going to need to actually write full assembly. Most of the time, whatever intrinsic you would have used, the compiler already knew about. Where you get your power is more often in restructuring your loops to best manage your pipeline (and intrinsics don't help there). It's possible this has improved over the last couple of years, but I would expect the improving vector optimizer to outpace the value of intrinsics more than the other way around.

Since gcc 4.8, the code generation for NEON intrinsics is pretty good, although with a few exceptions. Some users report that the compiler will generate faster code than their hand-written assembly, as the compiler understands the CPU pipeline better and the scheduler achieves better results than they do. — Charles Baylis, Feb 17 '15 at 10:26
Very nice edit. Let me just make a few comments on that. Why should one avoid memory allocation at all costs? Isn't NEON performing operations on arrays of numbers stored in memory? Would you agree that NEON isn't compatible with a heavy Object Oriented Program? Does NEON optimized code decrease drastically its readability? — Pedro Batista, Feb 18 '15 at 11:38
Memory allocation is one of the most expensive things you can possibly do in performance-critical code. (Disk and network are of course even more expensive, but their cost is so staggeringly high that they're not even part of the discussion in most cases, since they will dwarf everything else anyway.) In high performance code, even reading off-chip RAM is something you want to be careful about. Making a memory allocation that at best will lock the heap, might involve the kernel, and in the most horrific cases, swap to disk, is certainly not something you want to do in a loop. — Rob Napier, Feb 18 '15 at 16:01
Assembly-level programming, including NEON, is all about eliminating as many abstractions as possible. Even a non-inline function call is a really expensive operation at this level that should be done with care. OOP-style dynamic dispatch is insane. If you're worried about readability, you definitely don't want to be hand-coding to the chip. "Maintainable assembly" is a waste of time. You do all the work of assembly, for none of the benefits. That's why you almost never build an assembly routine that you don't already have working perfectly in C. — Rob Napier, Feb 18 '15 at 16:07
@CharlesBaylis If hand written assembly codes get beaten by compiler generated ones, it means nothing else than that the programmer is simply incompetent. — Jake 'Alquimista' LEE, Feb 18 '15 at 16:20

jww · Answer 2 · 2016-07-29T00:38:24.197

Here's a "mee too" with some blog posts from ARM. FIRST, start with the following to get the background information, including 32-bit ARM (ARMV7 and below), Aarch32 (ARMv8 32-bit ARM) and Aarch64 (ARMv8 64-bit ARM):

ARM NEON programming quick reference

Second, checkout the Coding for NEON series. Its a nice introduction with pictures so things like interleaved loads make sense with a glance.

I also went on Amazon looking for some books on ARM assembly with a treatment of NEON. I could only find two, and neither book's treatment of NEON were impressive. They reduced to a single chapter with the obligatory Matrix example.

I believe ARM Intrinsics are a very good idea. The instrinsics allow you to write code for GCC, Clang and Visual C/C++ compilers. We have one code base that works for ARM Linux distros (like Linaro), some iOS devices (using -arch armv7) and Microsoft gadgets (like Windows Phone and Windows Store Apps).

Mats Petersson · Answer 3 · 2015-02-17T20:53:19.843

In addition to Wally's answer - and probably should be a comment, but I couldn't make it short enough: ARM has a team of compiler developers whose entire role is to improve the parts of GCC and Clang/llvm that does code generation for ARM CPUs, including features that provides "auto-vectorization" - I have not looked deeply into it, but from my experience on x86 code generation, I'd expect for anything that is relatively easy to vectorize, the compiler should do a deecent job. Some code is hard for the compiler to understand when it can vectorize or not, and may need some "encouragement" - such as unrolling loops or marking conditions as "likely" or "unlikely", etc.

Disclaimer: I work for ARM, but have very little to do with the compilers or even CPUs, as I work for the group that does graphics (where I have some involvement with compilers for the GPUs in the OpenCL part of the GPU driver).

Edit:

Performance, and use of various instruction extensions is really depending on EXACTLY what the code is doing. I'd expect that libraries such as OpenCV is already doing a fair amount of clever stuff in their code (such as both handwritten assembler as compiler intrinsics and generally code that is designed to allow the compiler to already do a good job), so it may not really give you much improvement. I'm not a computer vision expert, so I can't really comment on exactly how much such work is done on OpenCV, but I'd certainly expect the "hottest" points of the code to have been fairly well optimised already.

Also, profile your application. Don't just fiddle with optimisation flags, measure it's performance and use a profiling tool (e.g. the Linux "perf" tool) to measure WHERE your code is spending time. Then see what can be done to that particular code. Is it possible to write a more parallel version of it? Can the compiler help, do you need to write assembler? Is there a different algorithm that does the same thing but in a better way, etc, etc...

Although tweaking compiler options CAN help, and often does, it can give tens of percent, where a change in algorithm can often lead to 10 times or 100 times faster code - assuming of course, your algorithm can be improved!

Understanding what part of your application is taking the time, however, is KEY. It's no point in changing things to make the code that takes 5% of the time 10% faster, when a change somewhere else could make a piece of code that is 30 or 60% of the total time 20% faster. Or optimise some math routine, when 80% of the time is spent on reading a file, where making the buffer twice the size would make it twice as fast...

I've edited my reply, but I'm not sure how much help it REALLY is, besides the old mantra of "Profile first, then optimimise". — Mats Petersson, Feb 17 '15 at 20:54
We've done extensive performance tests and are aware of which processes are the most heavy for the processor. The target of a possible optimization is one of most resource consuming process in the application. — Pedro Batista, Feb 18 '15 at 11:16
It is basically a 2-step cycle. First Iterate through all the pixels of an image and see if the pixel value verifies a condition (is pixel > x ?). Then, if that condition verifies I store those coordinates in a separate index array. After this, I run through that index list again to do some math. I believe this is process in question is a valid target for a serious optimization. — Pedro Batista, Feb 18 '15 at 11:17
Have you already moved your tests (like pixel > x) to a vectorized routine like OpenCV's threshold()? You should exhaust the vectorized tools that OpenCV gives you before you consider writing your own NEON to handle it. — Rob Napier, Feb 18 '15 at 16:11
My point was rather that "optimisation is very dependant on what exactly you are doing". `if (pixel > x) store(coordinate)` doesn't seem very easy in itself to parallelize. However, if the condition `(pixel > x)` is rare [e.g. you are searching a rather large with mostly indistinct stuff for something very distinct, one could easily write code that in a more parallel fashion compares a number of pixels with x and only later try to figure out exactly which coordinates are the right ones. — Mats Petersson, Feb 18 '15 at 21:00

James Greenhalgh · Answer 4 · 2015-02-17T10:54:31.863

If you have access to a reasonably modern GCC (GCC 4.8 and upwards) I would recommend giving intrinsics a go. The NEON intrinsics are a set of functions that the compiler knows about, which can be used from C or C++ programs to generate NEON/Advanced SIMD instructions. To gain access to them in your program, it is necessary to #include <arm_neon.h>. The verbose documentation of all available intrinsics is available at http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf , but you may find more user-friendly tutorials elsewhere online.

Advice on this site is generally against the NEON intrinsics, and certainly there are GCC versions which have done a poor job of implementing them, but recent versions do reasonably well (and if you spot bad code generation, please do raise it as a bug - https://gcc.gnu.org/bugzilla/ )

They are an easy way to program to the NEON/Advanced SIMD instruction set, and the performance you can achieve is often rather good. They are also "portable", in that when you move to an AArch64 system, a superset of the intrinsics you can use from ARMv7-A are available. They are also portable across implementations of the ARM architecture, which can vary in their performance characteristics, but which the compiler will model for performance tuning.

The principle benefit of the NEON intrinsics over hand-written assembly, is that the compiler can understand them when performing its various optimization passes. By contrast hand-written assembler is an opaque block to GCC, and will not be optimized. On the other hand, expert assembler programmers can often beat the compiler's register allocation policies, particularly when using the instructions which write to or read from to multiple consecutive registers.

Thanks for you answer :) There is just a little confusion in order to completely understand your answer. You call intrinsics to the code that is generated automatically by the compiler? Basically, what exactly are the intrinsics? — Pedro Batista, Feb 17 '15 at 10:41
It does indeed! Now I think I'm starting to understand. My main source of confusion was the fact that I would look to openCV source code, and some functions are written inside a #ifdef NEON macro, and there I see that they implement manually some cycles using those functions. — Pedro Batista, Feb 17 '15 at 10:59
The true power of NEON is unleashed by direct register allocations and smart unrolling which are simply not possible via intrinsics. Although it is possible to write halfway decent performing NEON codes with intrinsics, it requires comprehensive knowledge about how the SIMD instructions work, and this can only be acquired by learning assembly. And once it's acquired, writing in intrinsics is very inconvenient compared to assembly. That's the reason why you won't find many good intrinsics codes in first place. — Jake 'Alquimista' LEE, Feb 18 '15 at 10:22

Pedro Batista · Answer 5 · 2017-05-18T10:03:59.157

Although a long time has passed since I submitted this question, I realize that it gathers some interest and I decided to tell what I ended up doing regarding this.

My main goal was to optimize a for-loop which was the bottleneck of the project. So, since I don't know anything about Assembly I decided to give NEON intrinsics a go. I ended up having a 40-50% gain in performance (in this loop alone), and a significant overall improvement in performance of the whole project.

The code does some math to transform a bunch of raw distance data into distance to a plane in millimetres. I use some constants (like _constant05, _fXtoZ) that are not defined here, but they are just constant values defined elsewhere. As you can see, I'm doing the math for 4 elements at a time, talk about real parallelization :)

unsigned short* frameData = frame.ptr<unsigned short>(_depthLimits.y, _depthLimits.x);

unsigned short step = _runWidth - _actWidth; //because a ROI being processed, not the whole image

cv::Mat distToPlaneMat = cv::Mat::zeros(_runHeight, _runWidth, CV_32F);

float* fltPtr = distToPlaneMat.ptr<float>(_depthLimits.y, _depthLimits.x); //A pointer to the start of the data

for(unsigned short y = _depthLimits.y; y < _depthLimits.y + _depthLimits.height; y++)
{
    for (unsigned short x = _depthLimits.x; x < _depthLimits.x + _depthLimits.width - 1; x +=4)
    {
        float32x4_t projX = {(float)x, (float)(x + 1), (float)(x + 2), (float)(x + 3)};
        float32x4_t projY = {(float)y, (float)y, (float)y, (float)y};

        framePixels = vld1_u16(frameData);

        float32x4_t floatFramePixels = {(float)framePixels[0], (float)framePixels[1], (float)framePixels[2], (float)framePixels[3]};

        float32x4_t fNormalizedY = vmlsq_f32(_constant05, projY, _yResInv);

        float32x4_t auxfNormalizedX = vmulq_f32(projX, _xResInv);
        float32x4_t fNormalizedX = vsubq_f32(auxfNormalizedX, _constant05);

        float32x4_t realWorldX = vmulq_f32(fNormalizedX, floatFramePixels);

        realWorldX = vmulq_f32(realWorldX, _fXtoZ);

        float32x4_t realWorldY = vmulq_f32(fNormalizedY, floatFramePixels);
        realWorldY = vmulq_f32(realWorldY, _fYtoZ);

        float32x4_t realWorldZ = floatFramePixels;

        realWorldX = vsubq_f32(realWorldX, _tlVecX);
        realWorldY = vsubq_f32(realWorldY, _tlVecY);
        realWorldZ = vsubq_f32(realWorldZ, _tlVecZ);

        float32x4_t distAuxX, distAuxY, distAuxZ;

        distAuxX = vmulq_f32(realWorldX, _xPlane);
        distAuxY = vmulq_f32(realWorldY, _yPlane);
        distAuxZ = vmulq_f32(realWorldZ, _zPlane);

        float32x4_t distToPlane = vaddq_f32(distAuxX, distAuxY);
        distToPlane = vaddq_f32(distToPlane, distAuxZ);

        *fltPtr = (float) distToPlane[0];
        *(fltPtr + 1) = (float) distToPlane[1];
        *(fltPtr + 2) = (float) distToPlane[2];
        *(fltPtr + 3) = (float) distToPlane[3];

        frameData += 4;
        fltPtr += 4;
    }
    frameData += step;
    fltPtr += step;
}

wallyk · Answer 6 · 2015-02-16T19:39:34.167

4

If you don't want to mess with assembly code at all, then tweak the compiler flags to maximally optimize for speed. gcc given the proper ARM target should do this provided the number of loop iterations is apparent.

To check gcc code generation, request assembly output by adding the -S flag.

If after several tries (of reading the gcc documentation and tweaking flags) you still can't get it to produce the code you want, then take the assembly output and edit it to your satisfaction.

Beware of premature optimization. The proper development order is to get the code functional, then see if it needs optimization. Only when the code is stable does it makes sense to do so.

edited Feb 16 '15 at 19:39

answered Feb 16 '15 at 18:13

wallyk

56,922
16
83
148

So, what you are saying is that if I give the right instructions to the GCC compiler, it will generate NEON code for me. Otherwise, it is really necessary to dig into assembly development, am I correct? – Pedro Batista Feb 16 '15 at 18:27
@PedroBatista: For well supported architectures such as ARM, gcc's code generator has been thoroughly developed and usually produces extremely good code. For oddball CPUs, there is more hit and miss in code generation. Study the [ARM compilation option flags](https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html) and you'll gain some insight into how much has been done. – wallyk Feb 16 '15 at 19:37
What do you consider an "oddball CPU"? – Pedro Batista Feb 17 '15 at 11:06
While there might be some use for NEON intrinsics, auto-vectorization is utterly useless. It's not that compilers are still immature, but it's simply an impossibility. – Jake 'Alquimista' LEE Feb 25 '15 at 11:28

score 0 · Answer 7 · answered Oct 21 '18 at 22:02

Play with some minimal assembly examples on QEMU to understand the instructions

The following setup does not have many examples yet, but it serves as a neat playground:

The examples run on QEMU user mode, which dispenses extra hardware, and the GDB is working just fine.

The asserts are done through the C standard library.

You should be a able to easily extend that setup with new instructions as you learn them.

ARM intrinsincs in particular were asked at: Is there a good reference for ARM Neon intrinsics?

Coding for ARM NEON: How to start?

7 Answers7

Linked