Compiling Intel AVX instrinsics for Linux Device Driver with GCC

Question

I am running gcc version 4.8.2 on ubuntu on corei7.

Found about AVX intrinsics from google search, but I am not sure if this set of intrinsics can be used and compiled for Linux device driver.

If it can, anybody here can tell me what is the right settings for makefile and what header files to include in the c source to get this avx compiled with gcc?

Thanks.

Peter Cordes · Answer 1 · 2021-11-09T08:45:30.333

You can't safely compile a whole file with -march=sandybridge or other options that imply -mavx and -msse. That would let GCC use AVX instructions everywhere in functions in this file, including before kernel_fpu_begin() or after kernel_fpu_end()¹. It might for example use pxor xmm0,xmm0 / vmovups [rsp], xmm0 to zero some stack memory for local vars, especially a struct initializer. This would silently corrupt user-space FP/SIMD state.

But with GCC/clang, you can't use intrinsics or __builtin_ia32 functions for instructions that you haven't told the compiler the target supports. The compiler will refuse to emit such instructions. (MSVC and ICC follow a different design philosophy, where they don't generally optimize intrinsics, but you can use them anywhere.)

Inline asm is different: then you're directly printing text into the compiler's asm output, which it doesn't know about except what you tell it via constraints. (This is literally true for GCC, and clang with its built-in assembler allows the same bypassing of target options.) This is why kernel code normally uses inline asm instead of futzing with target options and intrinsics. (Also because it's usually for one hand-tuned loop.)

Footnote 1: Unless this file only contains functions called with kernel_fpu_begin() already done, i.e. you put your intrinsic-using functions in a separate file from one that does kernel_fpu_begin(); bar_avx(); kernel_fpu_end();. But that's inconvenient, and GNU C has other ways of setting target ISA-extension options that work on a per-function basis.

The safe way: `attribute((target("foo")))`

What you can safely do, I think, is write a function with __attribute__((target("avx,sse"))), and only call it between kernel_fpu_begin() / kernel_fpu_end(). A function can't inline into a caller with different target options, so make sure it contains a loop; don't call it in a tight loop.

With modern GCC/clang at least (GCC7 and later), #include <immintrin.h> will define all the functions and types even if you include it when __AVX__ is not defined, so you can just include it in a file compiled with -mno-sse like kernel code normally is, and then use SSE / AVX2 intrinsics in an AVX2 function. You don't need #pragma GCC target("avx2,bmi2") before the include, even though it needs to define some inline functions that return __m256i.

With GCC6 and earlier, #include <immintrin.h> breaks with -mno-sse -mno-mmx, but #pragma GCC target("sse,mmx,avx2,bmi2") fixes it, getting all the necessary AVX2 types and functions defined. https://godbolt.org/z/M4sEs61EG. With early clang, I sometimes managed to get it to emit scalar emulations of SIMD intrinsics, but that's not helpful. Use clang7 or later.

Example on the Godbolt compiler explorer (with some extra comments about pragmas, and //__attribute__((target("avx2,bmi2,arch=haswell"))) apparently working on clang but not GCC, IDK why. I was hoping it would set tune options as well, because tune=generic makes poor choices, splitting unaligned 256-bit load/store with many GCC versions. You can safely use -mtune=haswell or tune=intel for the whole file, if necessary. (That should be not bad for Zen1, and probably good for Zen2/3.)

void kernel_fpu_begin(void);   // stub declarations for a stand-alone test file
void kernel_fpu_end(void);

static void bar_avx(int *p, unsigned long len);
void foo(int *p, unsigned long len)
{
    kernel_fpu_begin();
    bar_avx(p, len);   // can't inline because it has different target options than this caller
    kernel_fpu_end();
}

#include <immintrin.h> 

__attribute__((target("avx2,bmi2")))   // works with both GCC and clang
static void bar_avx(int *p, unsigned long len){
    __m256i v = _mm256_loadu_si256( (__m256i*)p );
    v = _mm256_slli_epi32(v, 2);     // left shift 8 ints by 2
    _mm256_storeu_si256((__m256i*)p, v);
    p[10] = _pext_u64(len, len);     // collect set bits at the bottom.
}

Compiles to the following asm with gcc11.2 -O2 -mno-vzeroupper -mno-avx -mno-sse -mno-mmx -Wall -mcmodel=kernel -ffreestanding. (Not exactly what Linux uses, but does fully disable all MMX, SSE, and AVX code-gen. Probably -mno-avx is redundant with -mno-sse.)

bar_avx:
        vmovdqu ymm1, YMMWORD PTR [rdi]
        pext    rsi, rsi, rsi
        mov     DWORD PTR [rdi+40], esi
        vpslld  ymm0, ymm1, 2
        vmovdqu YMMWORD PTR [rdi], ymm0
           # vzeroupper      present without -mno-vzeroupper.  But not needed because kernel_fpu_end is about to xrstor and replace the current YMM state
        ret

foo:
        push    r12
        mov     r12, rsi
        push    rbp
        mov     rbp, rdi    # save the incoming args in call-preserved regs
        sub     rsp, 8      # align the stack

        call    kernel_fpu_begin
        mov     rsi, r12
        mov     rdi, rbp
        call    bar_avx

        add     rsp, 8          # epilogue restoring stack and saved regs
        pop     rbp
        pop     r12
        jmp     kernel_fpu_end  # tailcall

All usage of AVX instructions is bracketed between kernel_fpu_begin/end.

Of course, I didn't do anything that would tempt the compiler to use SIMD instructions, like zero-init an array or struct. But the fact that bar_avx() isn't inlined is pretty clear evidence that GCC and clang are keeping that function separate because of having different target options. They don't know how to optimize within a single function with different blocks having different target options, so they need to not inline. bar_avx() is quite small and definitely would inline normally, especially with it being static so it wouldn't need to also emit a stand-alone copy of it.

Integer intrinsics:

You can safely use intrinsics that only operate general-purpose integer registers, like _popcnt_u32 or BMI2 _pdep_u64, as long as you enable the appropriate CPU features like -mpopcnt and -mbmi2 respectively. Be sure not to indirectly enable SSE, though, like -msse4.2 or -march=haswell would do.

You don't even need kernel_fpu_begin/end around those, because they only use general-purpose integer registers, same as instructions like add and imul.

It would be safe to compile your whole kernel with -mbmi -mbmi2 -mpopcnt, as long as you don't care about it running on CPUs before Haswell / Excavator, or Intel Pentium/Celeron CPUs. (Intel disables VEX-prefix decoding on their low-end CPUs below i3, at least before Ice Lake, so that means disabling BMI1 and BMI2 which use that encoding for integer instructions.)

But if you want to use runtime CPU detection to avoid executing them on CPUs that don't support them, again you'll need to use __attribute__((target("bmi2"))) on some function. If you compiled a whole file with -mbmi2, GCC might decide to use shlx or shrx for some variable-count shift outside the CPU-detection block, for example.

Header files for x86 SIMD intrinsics - immintrin.h for most things Intel documents in their intrinsics guide, like things that are ISA extensions.
x86intrin.h (sort of equivalent to MSVC intrin.h) for that and more, including stuff like _bit_scan_forward and rdtsc that aren't in immintrin.h
Why am I able to perform floating point operations inside a Linux kernel module?
Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? don't use _mm256_loadu_pd without tuning options on GCC, if you don't want GCC to pessimize it into vmovups xmm / vinsertf128 with tune=generic.
The GCC manual re: the target attribute (works with GCC and clang)
and #pragma GCC target("avx,sse") (clang doesn't know about this)
Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions? some wild guesses about vzeroupper in user-space maybe making context switches cheaper. Pretty sure it's irrelevant here because kernel_fpu_end is not going to save the current FPU state left by the kernel, just overwrite it with a user-space state. And nothing is going to run a legacy-SSE instruction before that xrstor or xrstors. So -mno-vzeroupper is good here.

Doesn't gcc have equivalent of MSVC [/kernel](https://learn.microsoft.com/en-us/cpp/build/reference/kernel-create-kernel-mode-binary?view=msvc-170) switch? A similar switch would have been useful for gcc to prohibit xmm regs, yet keep other CPU assumptions. — Alex Guteniev, Nov 09 '21 at 08:48
@AlexGuteniev: `-ffreestanding` is for non-hosted programs, i.e. kernels, but does not imply ISA options. I'm guessing GCC devs would say "we don't know what kind of kernel people want to develop with GCC", and instead just provides `-mno-mmx` / `-mno-sse`. Or other things on other ISAs, which get the job done. There is an x86-64 `-mcmodel=kernel` for non-PIE linking into the top 2GiB of virtual address-space, instead of the bottom 2. (So static addresses fit in a signed imm32 but not unsigned the way they do for the x86-64 SysV "small" memory model, static code/data in the low 2GiB.) — Peter Cordes, Nov 09 '21 at 08:59
@AlexGuteniev: One could argue that `/kernel` is the set of options MS wanted their compiler to have to develop their own kernel with, but GCC tries not to be specific to developing Linux or *BSD. To be fair, though, most kernels do save/restore FP/SIMD regs separately from integer. So it wouldn't be weird to have an option for that, especially to easily block `double` / `float` code from compiling except in certain files with different options. — Peter Cordes, Nov 09 '21 at 09:02
@AlexGuteniev: For x86, ARM, and AArch64, GCC has a `-mgeneral-regs-only` option ([docs](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mgeneral-regs-only-2)) to stop it from using anything but the GP-integer registers. So that safely rules out any accidental x87 usage (or XMM). Happened to come across it in GCC docs and remembered / was able to google this recent conversation :) Update, looks like there's a `-mno-fp-regs` for Alpha, for example. There may be similar options for other ISAs. — Peter Cordes, Dec 03 '21 at 10:01

score 1 · Answer 2 · answered Nov 09 '21 at 00:59

When building with "x86intrin.h" for the Linux kernel, block the definitions of _mm_malloc() and _mm_free() by defining _MM_MALLOC_H_INCLUDED like so:

#define _MM_MALLOC_H_INCLUDED
#include <x86intrin.h>

The memory allocation functions won't be available, but they wouldn't work in the kernel anyway. Other intrinsics will be available.

Compiling Intel AVX instrinsics for Linux Device Driver with GCC

2 Answers2

The safe way: `attribute((target("foo")))`

Integer intrinsics:

Linked

Compiling Intel AVX instrinsics for Linux Device Driver with GCC

2 Answers2

The safe way: __attribute__((target("foo")))

Integer intrinsics:

Linked

The safe way: `attribute((target("foo")))`