18

I have been developing C++ code for augmented reality on ARM devices and optimization of the code is very important in order to keep a good frame rate. In order to rise efficiency to the maximum level I think it is important to gather general tips that make life easier for compilers and reduce the number of cicles of the program. Any suggestion is welcomed.

1- Avoid high-cost instructions: division, square root, sin, cos

  • Use logical shifts to divide or multiply by 2.
  • Multiply by the inverse when possible.

2- Optimize inner "for" loops: they are a botleneck so we should avoid making many calculations inside, especially divisions, square roots..

3- Use look-up tables for some mathematical functions (sin, cos, ...)

USEFUL TOOLS

  • objdump: gets assembly code of compiled program. This allows to compare two functions and check if it is really optimized.
Jav_Rock
  • 22,059
  • 20
  • 123
  • 164
  • 9
    **Beware**: nowadays the bottleneck is memory more often than not (and therefore LUT are not so great...). It might differ on ARM, admittedly, but... better check that invest for nothing. – Matthieu M. May 29 '12 at 13:48
  • Yep. But in real-time applications, performing a lot of calculations per frame, believe me that optimization can save "some frames per second". Say "some" are 8fps, as with my case, that's why I think this question is important. – Jav_Rock May 29 '12 at 13:59
  • 3
    Do you have the possibility to check different metrics, like cache misses, memory bus accesses, etc? This is also very helpful to know if your mem bus is a bottleneck. BTW, off-topic, (donostia == San Sebastian) ? If so, I really like that city! – Brady May 29 '12 at 14:02
  • 3
    ja_Rock: Matthieu is saying to _profile first_. Since memory is often a bottleneck, lookup tables are sometimes _slower_ than calculating it. – Mooing Duck May 29 '12 at 14:25
  • But, for example, look-up tables, in the case of an 1024-point FFT save a lot of multiplications. – Jav_Rock May 29 '12 at 14:40
  • 2
    @Jav_Rock: Indeed, so if you're optimising for an ancient processor with a slow multiplier and a CPU speed similar to its bus speed, then a lookup table is a win. On a modern processor, with a fast multiplier and a comparatively slow bus, it may be slower. Therefore, it's very important to profile and make sure that it actually is an improvement. – Mike Seymour May 29 '12 at 16:27
  • 1
    Here is an interesting counter-argument for using look-up tables for simple mathematical functions (sin, cos), at least if it makes sense to use fixed point types: http://www.coranac.com/2009/07/sines/ - the author provides C and hand-coded assembler versions of sin, comparing accuracy and average cycles of the different versions. – FooF Jun 12 '12 at 12:38

2 Answers2

18

To answer your question about general rules when optimizing C++ code for ARM, here are a few suggestions:

1) As you mentioned, there is no divide instruction. Use logical shifts or multiply by the inverse when possible.
2) Memory is much slower than CPU execution; use logical operations to avoid small lookup tables.
3) Try to write 32-bits at a time to make best use of the write buffer. Writing shorts or chars will slow the code down considerably. In other words, it's faster to logical-OR the smaller bits together and write them as DWORDS.
4) Be aware of your L1/L2 cache size. As a general rule, ARM chips have much smaller caches than Intel.
5) Use SIMD (NEON) when possible. NEON instructions are quite powerful and for "vectorizable" code, can be quite fast. NEON intrinsics are available in most C++ environments and can be nearly as fast as writing hand tuned ASM code.
6) Use the cache prefetch hint (PLD) to speed up looping reads. ARM doesn't have smart precache logic the way that modern Intel chips do.
7) Don't trust the compiler to generate good code. Look at the ASM output and rewrite hotspots in ASM. For bit/byte manipulation, the C language can't specify things as efficiently as they can be accomplished in ASM. ARM has powerful 3-operand instructions, multi-load/store and "free" shifts that can outperform what the compiler is capable of generating.

BitBank
  • 8,500
  • 3
  • 28
  • 46
  • I like multiplying by the inverse and logical shifts. I also try to use fixed-point on devices without NEON. I will update the post with your tips, thanks! – Jav_Rock May 29 '12 at 19:58
  • Regarding 5): I have seen quite a few SO discussions about using the NEON intrinsics. To sum it up, it seems a lot of people are finding that the compiler is not making a very good job of translating the intrinsics into good assembly code. The consensus seems to be that if you want to use NEON, you are better off writing it as assembly code directly. – Leo May 30 '12 at 07:49
  • @Leo - it depends on the compiler. GCC is very bad at compiling NEON intrinsics. Apple's LLVM is so-so and Microsoft's compilers are quite good. – BitBank May 30 '12 at 13:06
  • Ah, good to know. Have mostly used GCC myself. Do you know how ARMs own compilers (ADS/RVCT) fare? I would assume they would be good, but hard to know without testing. – Leo May 30 '12 at 14:13
  • I haven't used ARM's compilers, but I would assume that they're good. It would be quite an embarrassment for ARM's own tools to produce bad output. – BitBank May 30 '12 at 14:32
17

The best way to optimize an application is to use a good profiler. Its always a good idea to write code thinking about efficiency, but you also want to avoid making changes where you "think" the code may be slow, this could possibly make things worse if you're not 100% sure.

Find out where the bottlenecks are and focus on those.

For me profiling is an iterative process, because usually when you fix one bottleneck, other less important ones manifest themselves.

In addition to profiling the SW, check what sort of HW profiling is available. Check if you can get different HW metrics, like cache misses, memory bus accesses, etc. This is also very helpful to know if your mem bus or cache is a bottleneck.

I recently asked this similar question and got some good answers: Looking for a low impact c++ profiler

Community
  • 1
  • 1
Brady
  • 10,207
  • 2
  • 20
  • 59