How to identify where/for what inline assembly can give a higher execution speed?

Question

I know that when writing some (e.g.) real-time applications execution speed is very important. Sometimes it is possible to obtain higher speed executions by writing inline assembly.

I would like to know what would be a good way to identify:

1) where most of the time is lost executing an algorithm

2) whether writing inline assembly will really enhance execution speed

Thank you in advance.

This question is "extremely" platform-dependent (by *platform*, I mean both the underlying HW architecture and the designated compiler). Some manufacturers provide their platform along with an IDE, which consists (among other things) of a profiler, which allows you to analyze your code and identify potential areas for such optimization. Some also provide loop-unrolling pragmas, a set of intrinsics (designated machine operations wrapped in "readable macros"), and other tools, all specifically designated for their platform. So there is no general "C-language standard" answer to this question. — barak manos, Oct 30 '16 at 14:49
The classic answer not platform specific: Assembly optimisation: just even think about it! — chqrlie, Oct 30 '16 at 14:53
The very definition of real-time is _late answers are wrong answers_. But the goal is _not_ fast execution of any given algorithm, it is timely production of useful results. One resource not to waste is developer time. Profiling can tell where time is _spent_, inline assembly is but one tool in the box - before using it, have a good look at the assembly "optimising" compilers generate for your "high level"/C code. — greybeard, Oct 31 '16 at 06:20

score 4 · Answer 1 · answered Oct 30 '16 at 21:11

1) where most of the time is lost executing an algorithm

Time is not "lost", merely perhaps wasted. The efficiency of any algorithim will depend on many things such as:

Selection of the most appropriate algorithm for the problem in hand,
How well it has been coded,
What language it was coded in,
The efficiency of the compiler code generation and optimisation,
Selection of appropriate compiler options.

That is to say that your question is unanswerable. It is normally determined on a case-by-case basis using profiling of the code in question, but there is plenty that can be done before jumping to assembly code. A poorly chosen or implemented algorithm may run faster in assembly code, but it is still a poor choice and/or implementation and you may have got better results simply by getting that right.

2) whether writing inline assembly will really enhance execution speed

The first thing to consider is how good are you at writing assembly code, and how familiar are you with the instruction set of the specific target? Or perhaps how expensive is the expert you need to employ to achieve any real benefit.

How much time are you prepared to spend hand-crafting assembly code before you discover that you cannot achieve any significantly useful benefit, or that the time taken to do so has caused your project to fail in any case.

Also consider that the compiler optimiser embodies a great deal of expertise in the architecture and instruction set of the target it is generating code for, and it requires a lot of time, and expertise to beat it in any significant way.

Another thing to consider is the lack of portability of assembler code. If your development moves to a different architecture, all of that expensively generated assembly code code may be rendered obsolete and have to be redeveloped or ported by hand (requiring the poor maintainer to actually understand what the code does).

I have been writing hard-real time and DSP systems for a long time and have never resorted to assembler for performance reasons. I have used it only to achieve things that cannot be done in a high-level language such as C, such as manipulating core registers such as the program-counter and stack-pointer (in real-time scheduler for example). In one case I have worked on an application that ran on a 200MHz DSP with large amounts of code written in assembler was ported to a 72MHz MCU written entirely in C++. This was done through a combination of better design and use of DMA to capture and process signals in blocks of samples rather on a sample-by-sample basis significantly reducing interrupt rate and software overhead. Another example I have experience of is an electronically commutated motor application entirely written in PIC assembler which was re-worked in C but by more appropriate use of the available PWM and timer/counter hardware, the C implementation was more precise and efficient and smaller in code size that the the 100% assembler implementation.

Real-time systems are less often about speed of execution and more frequently about deterministic behaviour and meeting deadlines. Often complex processing can be deferred, so meeting deadlines can often be achieved by careful design rather than through micro-optimisation. Often it is possible to utilise hardware features such as interrupt handling, DMA, and timer capture to achieve performance gains.

Often it is less costly and far simpler to get the performance gain you need by selecting a faster processor in the first place . I would suggest that using assembler to gain necessary performance gains is the last resort of the desperate and often indicative of either poor software design and/or implementation or inappropriate processor selection.

chqrlie · Answer 2 · 2016-10-30T15:15:45.073

Use a profiler to determine where the time is spent for some pertinent benchmarks.

There is no need optimizing parts of your program that do not represent a significant portion of execution time.

Assembly is inherently non portable. It is a black art, very difficult to master and maintain. Indeed maintenance is needed as processor architectures evolve. Compiler writer spend huge efforts trying to take advantage of these improvements. It would take very specific circumstances to warrant the cost of assembly level optimisation. Access to specific assembly level instructions may be required for some operating system tasks, but production code rarely justifies this approach.

Even vector instructions should not be manipulated as inline assembly in C or C++ programs, processor vendors provide macros to encapsulate these.

If profiling shows identifiable bottlenecks in your code, you should first try and optimize the C code in C, while thinking of a potentially better algorithm.

If in last resort, because you have the necessary skills available, you decide to use inline assembly, clearly identify the functions that use such non-portable implementations and keep an alternate reference implementation in pure C for comparison and portability to other architectures. And of course, benchmark the resulting code and only use it if the gain is significant.

In short:

1) where most of the time is lost executing an algorithm

Use a profiling tool

2) whether writing inline assembly will really enhance execution speed

Maybe, but very difficult and most likely not worth the effort.

Or as a last resort (continuing your "first try and optimize the C code" statement), keep the assembly parts within specific files/modules, so that the entire code could easily be adapted when porting to a different platform. — barak manos, Oct 30 '16 at 14:56
I up-voted both the question and your answer (somebody here obviously has something against this post). — barak manos, Oct 30 '16 at 15:18
upvoted for pointing out that the asm implementation needs maintenance for future CPU microarchitectures. Great point. 15 or 20 year old C still compiles to good asm with modern compilers, but old inline asm might be tuned for P5, and not be optimal for Skylake. (And definitely won't take advantage of things like newer vector instruction sets.) — Peter Cordes, Oct 31 '16 at 02:11

score 2 · Answer 3 · edited May 23 '17 at 11:48

1) where most of the time is lost executing an algorithm

Use a profiler to find hot spots. It's not even worth looking at the compiler's asm output for code that isn't part of an important loop.

2) whether writing inline assembly will really enhance execution speed

Look at the compiler's asm output and see if it's doing something stupid, and that you could do better. This requires knowing the microarchitecture you're targeting, so you know what's slow vs. fast. If you're targeting x86, see the x86 tag wiki for perf guides (e.g. Agner Fog's optimizing assembly guide, microarchitecture guide, and instruction tables, as well as Intel's optimization manual)

As @chqrlie points out, any hand-written asm will also be tuned for some specific microarchitecture, and may not be optimal on future CPUs. Out-of-order execution often hides instruction-ordering issues, but not all ARM CPUs are out-of-order, so scheduling matters.

Your first attempt should be to tweak the C to guide the compiler into a smarter way of implementing the same logic, like I did in this answer.

If the problem is vectorizable, but the compiler doesn't auto-vectorize it, your first course of action should be to manually vectorize it with intrinsics, not with inline-asm. Compilers can do a good job optimizing code that uses intrinsics.

Writing inline asm (or whole function in asm that you call from C) should be a last resort. Besides the portability and maintainability problems, inline asm defeats compiler optimizations like constant-propagation. See https://gcc.gnu.org/wiki/DontUseInlineAsm.

If one of the inputs to your function is a compile-time-constant (after inlining and link-time optimization), a C implementation (with intrinsics) will simplify to the special case for that constant input.

But an inline-asm version won't simplify at all. The compiler will just MOV constant values into registers and run your asm as written. In GNU C, you can sometimes detect and avoid this by asking the compiler whether an input is a compile-time-constant. e.g. if(__builtin_constant_p(some_var)) { C implementation } else { asm(...); }. Unfortunately, clang doesn't propagate compile-time-constantness through function inlining, so it's always false for function args :(

And finally, if you think you can beat the compiler, make sure you actually succeeded by running a benchmark once you're done, against the best C implementation you can come up with.

How to identify where/for what inline assembly can give a higher execution speed?

3 Answers3