You can't safely compile a whole file with -march=sandybridge
or other options that imply -mavx
and -msse
. That would let GCC use AVX instructions everywhere in functions in this file, including before kernel_fpu_begin()
or after kernel_fpu_end()
1. It might for example use pxor xmm0,xmm0
/ vmovups [rsp], xmm0
to zero some stack memory for local vars, especially a struct initializer. This would silently corrupt user-space FP/SIMD state.
But with GCC/clang, you can't use intrinsics or __builtin_ia32
functions for instructions that you haven't told the compiler the target supports. The compiler will refuse to emit such instructions. (MSVC and ICC follow a different design philosophy, where they don't generally optimize intrinsics, but you can use them anywhere.)
Inline asm is different: then you're directly printing text into the compiler's asm output, which it doesn't know about except what you tell it via constraints. (This is literally true for GCC, and clang with its built-in assembler allows the same bypassing of target options.) This is why kernel code normally uses inline asm instead of futzing with target options and intrinsics. (Also because it's usually for one hand-tuned loop.)
Footnote 1: Unless this file only contains functions called with kernel_fpu_begin() already done, i.e. you put your intrinsic-using functions in a separate file from one that does kernel_fpu_begin();
bar_avx();
kernel_fpu_end();
. But that's inconvenient, and GNU C has other ways of setting target ISA-extension options that work on a per-function basis.
The safe way: __attribute__((target("foo")))
What you can safely do, I think, is write a function with __attribute__((target("avx,sse")))
, and only call it between kernel_fpu_begin()
/ kernel_fpu_end()
. A function can't inline into a caller with different target options, so make sure it contains a loop; don't call it in a tight loop.
With modern GCC/clang at least (GCC7 and later), #include <immintrin.h>
will define all the functions and types even if you include it when __AVX__
is not defined, so you can just include it in a file compiled with -mno-sse
like kernel code normally is, and then use SSE / AVX2 intrinsics in an AVX2 function. You don't need #pragma GCC target("avx2,bmi2")
before the include, even though it needs to define some inline functions that return __m256i
.
With GCC6 and earlier, #include <immintrin.h>
breaks with -mno-sse -mno-mmx
, but #pragma GCC target("sse,mmx,avx2,bmi2")
fixes it, getting all the necessary AVX2 types and functions defined. https://godbolt.org/z/M4sEs61EG. With early clang, I sometimes managed to get it to emit scalar emulations of SIMD intrinsics, but that's not helpful. Use clang7 or later.
Example on the Godbolt compiler explorer (with some extra comments about pragmas, and //__attribute__((target("avx2,bmi2,arch=haswell")))
apparently working on clang but not GCC, IDK why. I was hoping it would set tune options as well, because tune=generic makes poor choices, splitting unaligned 256-bit load/store with many GCC versions. You can safely use -mtune=haswell
or tune=intel
for the whole file, if necessary. (That should be not bad for Zen1, and probably good for Zen2/3.)
void kernel_fpu_begin(void); // stub declarations for a stand-alone test file
void kernel_fpu_end(void);
static void bar_avx(int *p, unsigned long len);
void foo(int *p, unsigned long len)
{
kernel_fpu_begin();
bar_avx(p, len); // can't inline because it has different target options than this caller
kernel_fpu_end();
}
#include <immintrin.h>
__attribute__((target("avx2,bmi2"))) // works with both GCC and clang
static void bar_avx(int *p, unsigned long len){
__m256i v = _mm256_loadu_si256( (__m256i*)p );
v = _mm256_slli_epi32(v, 2); // left shift 8 ints by 2
_mm256_storeu_si256((__m256i*)p, v);
p[10] = _pext_u64(len, len); // collect set bits at the bottom.
}
Compiles to the following asm with gcc11.2 -O2 -mno-vzeroupper -mno-avx -mno-sse -mno-mmx -Wall -mcmodel=kernel -ffreestanding
. (Not exactly what Linux uses, but does fully disable all MMX, SSE, and AVX code-gen. Probably -mno-avx
is redundant with -mno-sse
.)
bar_avx:
vmovdqu ymm1, YMMWORD PTR [rdi]
pext rsi, rsi, rsi
mov DWORD PTR [rdi+40], esi
vpslld ymm0, ymm1, 2
vmovdqu YMMWORD PTR [rdi], ymm0
# vzeroupper present without -mno-vzeroupper. But not needed because kernel_fpu_end is about to xrstor and replace the current YMM state
ret
foo:
push r12
mov r12, rsi
push rbp
mov rbp, rdi # save the incoming args in call-preserved regs
sub rsp, 8 # align the stack
call kernel_fpu_begin
mov rsi, r12
mov rdi, rbp
call bar_avx
add rsp, 8 # epilogue restoring stack and saved regs
pop rbp
pop r12
jmp kernel_fpu_end # tailcall
All usage of AVX instructions is bracketed between kernel_fpu_begin/end.
Of course, I didn't do anything that would tempt the compiler to use SIMD instructions, like zero-init an array or struct. But the fact that bar_avx()
isn't inlined is pretty clear evidence that GCC and clang are keeping that function separate because of having different target options. They don't know how to optimize within a single function with different blocks having different target options, so they need to not inline. bar_avx()
is quite small and definitely would inline normally, especially with it being static
so it wouldn't need to also emit a stand-alone copy of it.
Integer intrinsics:
You can safely use intrinsics that only operate general-purpose integer registers, like _popcnt_u32
or BMI2 _pdep_u64
, as long as you enable the appropriate CPU features like -mpopcnt
and -mbmi2
respectively. Be sure not to indirectly enable SSE, though, like -msse4.2
or -march=haswell
would do.
You don't even need kernel_fpu_begin/end around those, because they only use general-purpose integer registers, same as instructions like add
and imul
.
It would be safe to compile your whole kernel with -mbmi -mbmi2 -mpopcnt
, as long as you don't care about it running on CPUs before Haswell / Excavator, or Intel Pentium/Celeron CPUs. (Intel disables VEX-prefix decoding on their low-end CPUs below i3, at least before Ice Lake, so that means disabling BMI1 and BMI2 which use that encoding for integer instructions.)
But if you want to use runtime CPU detection to avoid executing them on CPUs that don't support them, again you'll need to use __attribute__((target("bmi2")))
on some function. If you compiled a whole file with -mbmi2
, GCC might decide to use shlx
or shrx
for some variable-count shift outside the CPU-detection block, for example.
Related: