(Most of this was written for the original version of the question. It was edited after).
You mean purely for performance reasons, so excluding using special instructions in an OS kernel?
What you really ultimately want is machine code that executes efficiently. And the ability to modify some text files and recompile to get different machine code. You can usually get both of those things without needing inline asm, therefore:
GNU C inline assembly is hard to use correctly, but if you do use it correctly has very low overhead. Still, it blocks many important optimizations like constant-propagation.
See https://stackoverflow.com/tags/inline-assembly/info for guides on how to use it efficiently / safely. (e.g. use constraints instead of stupid mov
instructions as the first or last instruction in the asm template.)
Pretty much always inappropriate, unless you know exactly what you're doing and can't hand-hold the compiler to make asm that's quite as good with pure C or intrinsics. Manual vectorization with intrinsics certainly still has its place; compilers are still terrible at some things, like auto-vectorizing complex shuffles. GCC/Clang won't auto-vectorize at all for search loops like a pure C implementation of memchr
, or any loop where the trip-count isn't known before the first iteration.
And of course performance on current microarchitectures has to trump maintainability and optimizing differently for future CPUs. If it's ever appropriate, only for small hot loops where your program spends a lot of time, and typically CPU-bound. If memory-bound then there's usually not much to gain.
Over large scales, compilers are excellent (especially with link-time optimization). Humans can't compete on that scale, not while keeping code maintainable. The only place humans can still compete is in the small scale where you can afford the time to think about every single instruction in a loop that will run many iterations over the course of a program.
The more widely-used and performance-sensitive your code is (e.g. a video encoder like x264 or x265), the more reason there is to consider hand-tuned asm for anything. Saving a few cycles over millions of computers running your code every day starts to add up to being worth considering the maintenance / testing / portability downsides.
The one notable exception is ARM SIMD (NEON) where compilers are often still bad. I think especially for 32-bit ARM (where each 128-bit q0..15
register is aliased by 2x 64-bit d0..32
registers, so you can avoid shuffling by accessing the 2 halves as separate registers. Compilers don't model this well, and can easily shoot themselves in the foot when compiling intrinsics that you'd expect to be able to compile efficiently. Compilers are good at producing efficient asm from SIMD intrinsics for x86 (SSE/AVX) and PowerPC (altivec), but for some unknown reason are bad at optimizing ARM NEON intrinsics and often make sub-optimal asm.
Some compilers are not bad, e.g. apparently Apple clang/LLVM for AArch64 does ok more often than it used to. But still, see Arm Neon Intrinsics vs hand assembly - Jake Lee found the intrinsics version of his 4x4 float matmul was 3x slower than his hand-written version using clang, in Dec 2017. Jake is an ARM optimization expert so I'm inclined to believe that's fairly realistic.
or __asm
(in the case of VC++)
MSVC-style asm is usually only useful for writing whole loops because having to take inputs via memory operands destroys (some of) the benefit. So amortizing that overhead over a whole loop helps.
For wrapping single instructions, introducing extra store-forwarding latency is just dumb, and there are MSVC intrinsics for almost everything you can't easily express in pure C. See What is the difference between 'asm', '__asm' and '__asm__'? for examples with a single instruction: you get much worse asm from using MSVC inline asm than you would for pure C or an intrinsic if you look at the big picture (including compiler-generated asm outside your asm block).
C++ code for testing the Collatz conjecture faster than hand-written assembly - why? shows a concrete example where hand-written asm is faster on current CPUs than anything I was able to get GCC or clang to emit by tweaking C source. They apparently don't know how to optimize for lower-latency LEA when it's part of a loop-carried dependency chain.
(The original question there was a great example of why you shouldn't write by hand in asm unless you know exactly what you're doing and use optimized compiler output as a starting point. But my answer shows that for a long-running hot tight loop, there are significant gains that compilers are missing with just micro-optimizations, even leaving aside algorithmic improvements.)
If you're considering asm, always benchmark it against the best you can get the compiler to emit. Working on a hand-written asm version may give you ideas that you can apply to your C to hand-hold compilers into making better asm. Then you can get the benefit without actually including any non-portable inline asm in your code.