It appears that I needed to have two changes:
First, it appears to be best practice to include x86intrin.h
rather than more specific includes. This appears to be compiler specific and is covered in much better detail in:
Header files for x86 SIMD intrinsics
Importantly, you would have a different include if not using gcc.
Second, compiler options also need to be enabled. For gcc these are detailed in:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
Although documentation for many flags are lacking.
As my goal is to distribute a compiled binary, I wanted to try and avoid -march=native
Most of the "other" intrinsics I'm interested in are bit manipulation related.
Ye Olde Wikipedia has a decent writeup of important bit manipulation intrinsic groups like bmi2:
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets
I need bmi2 for BZHI
(instruction) or _bzhi_u32
(c)
Thus I can get what I want with something like:
-mavx2 -mbmi2
Using -mbmi2
seems to be sufficient to get things like bmi1 and abm (see linked Wikipedia page for definitions) although I don't see any mention of this in the linked gcc page so I might be wrong about this ... EDIT: It seems like adding bmi2 support does not add bmi1 and abm, I might have been using a __builtin call.... I later needed to add -mabm
and -mbmi
explicitly to get the instructions I wanted. As Peter Cordes suggested it is probably better to target Haswell -march=haswell
as a starting point and then add on additional flags as needed. Haswell is the first processor with AVX2 from 2013 so in my mind -march=haswell
is basically saying, I expect that you have a computer from 2013 or newer.
Also, based on some quick reading, it sounds like the use of __builtin enables the necessary flags (a future question for SO), although there does not appear to be a 1:1 correspondence between intrinsics and builtins. More specifically, not all intrinsics seem to be included as builtins, meaning the flag setting approach seems to be necessary, rather than just always using builtins and not worrying about setting flags. Also it is useful to know what intrinsics are being used, for distribution purposes, as it seems like bmi2 could still be missing on a substantial portion of computers (e.g. needing AMD from 2015+ - I think).
It's still not clear to me why just using the specified include in the Intel documentation doesn't work, but this info get's me 99% of the way to where I want to be.