Kernel code can only safely use FPU / SIMD between kernel_fpu_begin()
/ kernel_fpu_end()
to trigger an xsave
(and xrstor
before returning to user-space). Or xsaveopt or whatever.
That's a lot of overhead, and isn't worth it outside of a few rare cases (like md
RAID5 / RAID6 parity creation / use.)
Unfortunately this means only GP-integer registers are available for most kernel code. The difference between an AVX memcpy loop and a rep movsb
memcpy is not worth an xsave/xrstor on every system call.
Context switch vs. just entering the kernel background:
In user-space, the kernel handles state save/restore on context switches between user-space tasks. In the kernel, you want to avoid a heavy FPU save/restore every time you enter the kernel (e.g. for a system call) when you're about to return to the same user-space, so you just save the GP-integer regs.
For known-size copies, not having SSE/AVX is not too bad, especially on CPUs with the ERMSB feature (which is when this copy function is used, hence the enhanced_fast_string
in the name). For medium to large aligned copies, rep movsb
is nearly as fast on Intel CPUs at least, and hopefully also AMD. See Enhanced REP MOVSB for memcpy. Or without ERMSB, at least with rep movsq
+ cleanup.
In a 64-bit kernel, GP integer regs are half the size of XMM regs. For small copies (below the kernel's 64-byte threshold), 8x GP-integer 8-byte load and 8-byte store should be pretty efficient compared to the overhead of a system call in general. 4x XMM load/store would be nice, but it's a tradeoff against saving FPU state.
Not having SIMD is significantly worse for strlen
/strcpy
where pcmpeqb
is very good vs. a 4 or 8-byte at a time bithack. And SSE2 is baseline for x86-64, so an x86-64 kernel could depend on that without dynamic dispatch if not for the problem of saving FPU state.
You could in theory eat the SSE/AVX transition penalty and do like some bad Windows drivers and just manually save/restore the low 128 of a vector reg with legacy SSE instructions. (This is why legacy SSE instructions don't zero the upper bytes of the full YMM / ZMM). IDK if anyone's benchmarked doing that for a kernel-mode strcpy
or strlen
, or memcpy
.