(posting an answer here because Header files for x86 SIMD intrinsics has out of date answers that suggest including individual header files).
immintrin.h
is portable across all compilers, and includes all Intel SIMD intrinsics, and some scalar extensions like _pdep_u32
that are available with -mbmi2
or a -march=
that includes it. (For AMD SSE4a and XOP (Bulldozer-family only, dropped for Zen), you need to include a different header as well.)
The only reason I can think of for including <emmintrin.h>
specifically would be if you're using MSVC and want to leave intrinsics undefined for ISA extensions you don't want to depend on.
GCC's model of requiring you to enable extensions before you can use intrinsics for them means the compiler does this checking for you, so you can just #include <immintrin.h>
but still get an error if you try to use _mm_shuffle_epi8
(pshufb
) without -mssse3
.
Don't use compilers older than gcc4.4. They're obsolete and will typically generate slower code, especially for modern CPUs that didn't exist when their tuning settings were being decided.
gcc/clang's x86intrin.h
vs. MSVC intrin.h
are only useful if you need some extra non-SIMD intrinsics like MSVC's _BitScanReverse()
that aren't always portable across compilers. Stuff like integer rotate / bit-scan intrinsics that are baseline (unlike BMI1 lzcnt
/tzcnt
or BMI2 rorx
) but hard or impossible to express in C in a way that compilers will recognize and turn a loop back into a single instruction.
Intel documents some of those as being available in immintrin.h in their intrinsics guide, but gcc/clang and MSVC actually have them in their x86intrin.h
or intrin.h
headers, respectively.
See How to get the CPU cycle count in x86_64 from C++? for an example of using #ifdef _MSC_VER
to choose the right header to define uint64_t __rdtsc(void)
and __rdtscp()
.