Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.
AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.
Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER
on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.
Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS
and other shuffle / horizontal instructions.
See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.
See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.
Interesting Q&As / FAQs:
Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (
__m256*
types) require 32 bytes alignment, 512-bit memory operands (__m512*
types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explainsalignas
,aligned_alloc
,_aligned_malloc
, C++17 alignednew
, etc, and use of unalignedloadu
/storeu
intrinsics.Shuffling by mask with Intel AVX explains how shuffle-control vectors and
_MM_SHUFFLE
works. , Includes in-lane vs. lane-crossing for AVX.Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.