(Edited) When should one use inline assembly in c (outside of optimization)?

Question

Note: Edited to make the question non-oppion based

Assumptions

We are in user mode (not in the kernel)
The OS being used is either a modern version of Linux or a modern version of windows that uses a x86 CPU.

Other than optimization, is there a specific example where using inline assembly in a C program is needed. (If applicable, provide the inline assembly)

To be clear, injecting assembly language code through the use of the key words __asm__(in the case of GCC) or __asm (in the case of VC++)

There's no right answer here other than to say you should do this when your profiling shows you have a bottleneck, you have examined the generated code at your bottleneck, and you think you can do better than the optimizer with hand-coded assembler. Spending time optimizing code that is not a bottleneck is a waste of resources. This is known as premature optimization. — Jim Garrison, Jul 28 '19 at 04:59
Performance is not the only reason for using assembly, so the question is based on a false premise. — Crowman, Jul 28 '19 at 04:59
Why would you complicate things by using assembly if performance is not the objective? — Jim Garrison, Jul 28 '19 at 05:00
@JimGarrison: Because some things have to be done in assembly. Device drivers and operating systems do this all the time, for example. — Crowman, Jul 28 '19 at 05:01
@PaulGriffiths It was not my attention to say that performance is the only reason for inline assembly. It was just the thought that sparked the question. I have edited my question accordingly. — Cabbage Champion, Jul 28 '19 at 05:18
Im not sure why the question is being down voted after my edit. Anyone care to explain. I dont see how this question does not have a definite answer. — Cabbage Champion, Jul 28 '19 at 05:46
It also has 4 close votes, 1 "too broad" and 3 "primarily opinion based". Hard to argue with that; this is extremely open-ended and doesn't even restrict itself to any kind of machine (e.g. mainstream modern CPU with a good optimizing compiler) or a use-case like user-space under a mainstream OS. And even then, it's hard to do more than give opinions. My answer is certainly partly based on my opinion of how one should go about optimizing. I don't think it's a *controversial* opinion, though, so IDK where you draw the line. — Peter Cordes, Jul 28 '19 at 06:07
@PeterCordes I see what your saying. I rewrote the question to make it not opinion based but unfortunately the redone question is different from your answer. Sorry, but again thanks it was a great read! — Cabbage Champion, Jul 28 '19 at 06:30
Other than optimization, yes there are some obvious use-cases like working around old toolchains that don't have intrinsics for new instructions. Or for user-space context switching (e.g. for coroutines or something), but that's harder to do with *inline* asm instead of a whole function written in asm. I don't think that's an interesting a question. I'd suggest narrowing it to *specifically* the optimization use-case if you think that's actually more interesting. And BTW, Linux runs on a huge range of CPUs, so narrowing to Linux or Windows still doesn't limit it to x86. Intentional? — Peter Cordes, Jul 28 '19 at 06:35
Anyway, I don't think the optimization question deserves to be closed as opinion based, if you can restrict it to specific enough cases (like performance on modern x86 compilers for Linux / MacOS / Windows). — Peter Cordes, Jul 28 '19 at 06:38
@PeterCordes It was "Intentional" but the more I think about it should be limited to one type of CPU. I edited the question to reflect that. In reference to the making the question specifically about the optimization case, i am afraid that would put the question in danger of being a duplicate of "When is assembly faster than C". Ether way, I am very interested in seeing a specific example of the intrinsics situation so I keep the question narrowed at that point for now. Thanks for the input! — Cabbage Champion, Jul 28 '19 at 06:42
[AVX2 what is the most efficient way to pack left based on a mask?](//stackoverflow.com/q/36932240) has some examples of using intrinsics. Or [What's the fastest way to generate a 1 GB text file containing random digits?](//unix.stackexchange.com/a/324520) is a complete small program using some intrinsics. — Peter Cordes, Jul 28 '19 at 06:52
How about this: Existing code that has inline asm that is large enough and complex enough that no one knows how to maintain it anymore. When I was proposing to remove the ability to use "Basic asm in functions," people pointed me at code that was staggeringly complex, asking how that could ever safely be ported to straight asm or extended asm. Simply put, no one had any interest in touching the code. Given that it was working (as far as anyone knew), justifying the time and effort to rework it was challenging. This serves as both a "good" and a "bad" reason for inline asm at the same time. — David Wohlferd, Jul 29 '19 at 08:23
C is a [Turing complete language](https://en.wikipedia.org/wiki/Turing_completeness). — pmg, Jul 29 '19 at 14:38
@CabbageChampion: C being Turing complete means anything that can be written in any language, can also be written in C ... so ... **assembler isn't required instead of C for any thing assembler can do**. — pmg, Jul 29 '19 at 14:55
@pmg I obviously know what a Turing complete language is. I am not talking about the logical capability of the two languages. The fact that two languages (C and Assembly in our case) are Turing complete shouldn't take away that there are certain things from a hardware(or software*) standpoint that cant be done on one of those languages. (If there is. (thats my question) - from a comment above it was indicated that intrinsics is a possible answer to my question) — Cabbage Champion, Jul 29 '19 at 15:15
@pmg: Turing-completeness doesn't include *any* notion of performance. Or of low-level / systems programming (making specific hardware do specific things, i.e. writing the system-specific implementations that portable code uses). Turing-completeness is a completely useless argument; of course you *can* compute any result (eventually) in C, but that's the only reason people write code in C. Turing machines don't have I/O, they just leave their results on the tape (equivalent to in memory in C). Also, any C implementation has some fixed width for pointers so it can't address infinite storage. — Peter Cordes, Jul 30 '19 at 05:34

Peter Cordes · Accepted Answer · 2019-07-28T06:48:02.213

(Most of this was written for the original version of the question. It was edited after).
You mean purely for performance reasons, so excluding using special instructions in an OS kernel?

What you really ultimately want is machine code that executes efficiently. And the ability to modify some text files and recompile to get different machine code. You can usually get both of those things without needing inline asm, therefore:

https://gcc.gnu.org/wiki/DontUseInlineAsm

GNU C inline assembly is hard to use correctly, but if you do use it correctly has very low overhead. Still, it blocks many important optimizations like constant-propagation.

See https://stackoverflow.com/tags/inline-assembly/info for guides on how to use it efficiently / safely. (e.g. use constraints instead of stupid mov instructions as the first or last instruction in the asm template.)

Pretty much always inappropriate, unless you know exactly what you're doing and can't hand-hold the compiler to make asm that's quite as good with pure C or intrinsics. Manual vectorization with intrinsics certainly still has its place; compilers are still terrible at some things, like auto-vectorizing complex shuffles. GCC/Clang won't auto-vectorize at all for search loops like a pure C implementation of memchr, or any loop where the trip-count isn't known before the first iteration.

And of course performance on current microarchitectures has to trump maintainability and optimizing differently for future CPUs. If it's ever appropriate, only for small hot loops where your program spends a lot of time, and typically CPU-bound. If memory-bound then there's usually not much to gain.

Over large scales, compilers are excellent (especially with link-time optimization). Humans can't compete on that scale, not while keeping code maintainable. The only place humans can still compete is in the small scale where you can afford the time to think about every single instruction in a loop that will run many iterations over the course of a program.

The more widely-used and performance-sensitive your code is (e.g. a video encoder like x264 or x265), the more reason there is to consider hand-tuned asm for anything. Saving a few cycles over millions of computers running your code every day starts to add up to being worth considering the maintenance / testing / portability downsides.

The one notable exception is ARM SIMD (NEON) where compilers are often still bad. I think especially for 32-bit ARM (where each 128-bit q0..15 register is aliased by 2x 64-bit d0..32 registers, so you can avoid shuffling by accessing the 2 halves as separate registers. Compilers don't model this well, and can easily shoot themselves in the foot when compiling intrinsics that you'd expect to be able to compile efficiently. Compilers are good at producing efficient asm from SIMD intrinsics for x86 (SSE/AVX) and PowerPC (altivec), but for some unknown reason are bad at optimizing ARM NEON intrinsics and often make sub-optimal asm.

Some compilers are not bad, e.g. apparently Apple clang/LLVM for AArch64 does ok more often than it used to. But still, see Arm Neon Intrinsics vs hand assembly - Jake Lee found the intrinsics version of his 4x4 float matmul was 3x slower than his hand-written version using clang, in Dec 2017. Jake is an ARM optimization expert so I'm inclined to believe that's fairly realistic.

or __asm (in the case of VC++)

MSVC-style asm is usually only useful for writing whole loops because having to take inputs via memory operands destroys (some of) the benefit. So amortizing that overhead over a whole loop helps.

For wrapping single instructions, introducing extra store-forwarding latency is just dumb, and there are MSVC intrinsics for almost everything you can't easily express in pure C. See What is the difference between 'asm', '__asm' and '__asm__'? for examples with a single instruction: you get much worse asm from using MSVC inline asm than you would for pure C or an intrinsic if you look at the big picture (including compiler-generated asm outside your asm block).

C++ code for testing the Collatz conjecture faster than hand-written assembly - why? shows a concrete example where hand-written asm is faster on current CPUs than anything I was able to get GCC or clang to emit by tweaking C source. They apparently don't know how to optimize for lower-latency LEA when it's part of a loop-carried dependency chain.

(The original question there was a great example of why you shouldn't write by hand in asm unless you know exactly what you're doing and use optimized compiler output as a starting point. But my answer shows that for a long-running hot tight loop, there are significant gains that compilers are missing with just micro-optimizations, even leaving aside algorithmic improvements.)

If you're considering asm, always benchmark it against the best you can get the compiler to emit. Working on a hand-written asm version may give you ideas that you can apply to your C to hand-hold compilers into making better asm. Then you can get the benefit without actually including any non-portable inline asm in your code.

Wow this is incredibly in depth! Thank you for the insight! – Cabbage Champion Jul 28 '19 at 05:37 — Cabbage Champion, Jul 28 '19 at 05:37
That answer squarely hits the *nail-on-the-head*.... – David C. Rankin Jul 28 '19 at 06:31 — David C. Rankin, Jul 28 '19 at 06:31

(Edited) When should one use inline assembly in c (outside of optimization)?

1 Answers1

https://gcc.gnu.org/wiki/DontUseInlineAsm