Very few parts of Linux are written in asm for performance. See @Ped7g's answer for more about why kernels use inline asm for an occasional privileged instruction (like mov
to/from control registers), or whole files of hand-written asm for entry points (like interrupt
and system-call handler entry points that dispatch to a C function).
In Linux maybe just the RAID5 xor-parity (using SSE2 or AVX on x86) and RAID6 error-correction are written in asm for performance.
Those were presumably written directly in asm, because manually vectorizing in C with intrinsics isn't easier. The looping is still done with C in those Linux functions, IIRC.
(And it uses very bad style, with multiple separate asm("")
statements that use the XMM or YMM registers. This happens to work, especially in kernel code where the compiler will never generate code that uses XMM registers, but using a single asm block, or vector output/input operands, would be safer. See Linux's lib/raid6/sse2.c
for an example. There's also asm/xor.h
which has some generic block-xor functions with the looping done in asm, too, presumably used by other parts of the kernel.) That's one of the few places it uses SIMD vector registers, because saving/restoring the FPU state is expensive.
Linux probably uses inline asm for performance for the x86 CRC32 instruction if available; several things use the CRC32C polynomial which x86 accelerates.
For the more general case of your question, using compiler-generates asm as a starting point for optimization is often a good idea.
But if the compiler already emits good asm, you don't need to do anything and can just use that C. That's even better than inline asm because it can optimize with constant-propagation and so on. Or maybe you can tweak the C source to help the compiler do a more efficient job.
But if you can't get the compiler to make an optimal loop, then sure you can take its asm and optimize it by hand. As long as you benchmark against the original, you can't lose to the compiler. (Except in cases where your asm defeats optimizations when inlining makes something a compile-time constant.)
For more details about helping vs. beating the compiler, see C++ code for testing the Collatz conjecture faster than hand-written assembly - why?.
You'd only consider using a hand-written asm loop for very critical portions of a piece of software, especially in a portable code-base like Linux, because you need a different implementation for every platform.
And because what's optimal on Skylake isn't what was optimal on P5 Pentium 20 years ago, and might not be optimal on some future x86 20 years from now. Sticking to portable C lets tuning options like -march=skylake
do their job and make asm that's tuned for the specific microarchitecture you're compiling for. (Or lets updates in compilers default tuning take effect over the years.)
Not to mention that most kernel developers aren't asm tuning experts who can easily write near-optimal asm by hand. It's not something that people do often. If you like doing that, work on gcc or clang to make them generate more optimal code from C.