AVX512 is Intel's next generation of SIMD instructions that widens vectors to 512-bit, and adds new functionality (masking) and more vector registers.
AVX512 is a set of instruction set extensions for x86 that features 512-bit SIMD vectors.
Wikipedia's AVX-512 article is kept up to date with lists of the sub-extensions, and a handy table of which CPUs support which extensions: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
Other resources:
- Overview: Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions
- Slides from a talk by Kirill Yukhin, introducing the new features of AVX-512 like masking and embedded-rounding. (With Intel-syntax asm examples.) Includes some use-case examples like conflict-detection for histograms using gather/scatter.
- x86 tag wiki for x86 performance info,
especially https://uops.info/ and https://agner.org/optimize/ - sse tag wiki for guides to x86 SIMD in general.
AVX512 is broken into the sub-extensions including the following. While all AVX512 implementations are required to support AVX512-F, the rest are optional.
- AVX512-F (Foundation)
- AVX512-CD (Conflict Detection)
- AVX512-ER (Exponential and Reciprocal)
- AVX512-PF (Prefetch)
- AVX512-BW (Byte and Word instructions)
- AVX512-DQ (Double-word and quad-word instructions)
- AVX512-VL (Vector Length)
- AVX512-IFMA (52-bit Integer Multiply-Add)
- AVX512-VBMI (Vector Byte-Manipulation)
- AVX512-VPOPCNT (Vector Population Count)
- AVX512-4FMAPS (4 x Fused Multiply-Add Single Precision)
- AVX512-4VNNIW (4 x Neural Network Instructions)
- AVX512-VBMI2 (Vector Byte-Manipulation 2)
- AVX512-VNNI (Neural Network Instructions?)
- AVX512-BITALG (Bit Algorithms)
- AVX512-VAES (Vector AES Instructions)
- AVX512-VGFI (Galois Field Arithmetic)
- AVX512-VPCLMULQ (Vector Carry-less Multiply)
Supporting Processors:
- Intel Xeon Phi Knights Landing: AVX512-(F, CD, ER, PF)
- Intel Xeon Phi Knights Mill: AVX512-(F, CD, ER, PF, VPOPCNT, 4FMAPS, 4VNNIW)
- Intel Skylake Xeon: AVX512-(F, CD, BW, DQ, VL)
- Intel Cannonlake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI)
- Intel Ice Lake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI, VPOPCNT, VBMI2, VNNI, BITALG, VAES, VGFI, VPCLMULQ)
Foundation (AVX512-F):
All implementations of AVX512 are required to support AVX512-F. AVX512-F expands AVX by doubling the size of the vector width to 512 bits and double the number of registers to 32. It also provides embedded masking by means of 8 opmask registers.
AVX512-F only supports operations on 32-bit and 64-bit words and only operates on zmm (512-bit) registers.
Conflict Detection (AVX512-CD):
AVx512-CD aids vectorization by providing instructions to detect data conflicts.
Exponential and Reciprocal (AVX512-ER):
AVX512-ER provides instructions for computing the reciprocal and exponential functions with increased accuracy. These are used to aid in the fast computation of trigonometric functions.
Prefetch (AVX512-PF):
AVX512-PF provides instructions for vector gather/scatter prefetching.
Byte and Word (AVX512-BW):
AVX512-BW extends AVX512-F by adding support for byte and word (8/16-bit) operations.
Double-word and Quad-word (AVX512-DQ):
AVX512-DQ extends AVX512-F by providing more instructions for 32-bit and 64-bit data.
Vector-Length (AVX512-VL):
AVX512-VL extends AVX512-F by allowing the full AVX512 functionality to operate on xmm
and ymm
registers (as opposed to only zmm
). This includes the masking as well as the increased register count of 32.
52-bit Integer Multiply-Add (AVX512-IFMA):
AVX512-IFMA provides fused multiply-add instructions for 52-bit integers. (Speculation: likely derived from the floating-point FMA hardware)
Vector Bit-Manipulation (AVX512-VBMI):
AVX512-VBMI provides instructions for byte-permutation. It extends the existing permute instructions to byte-granularity.
Vector Population Count (AVX512-VPOPCNT)
A vectorized version of the popcnt
instruction for 32-bit and 64-bit words.
4 x Fused Multiply-Add Single Precision (AVX512-4FMAPS)
AVX512-4FMAPS provides instructions that perform 4 consecutive single-precision FMAs.
Neural Network Instructions (AVX512-4VNNIW)
Specialized instructions on 16-bit integers for Neural Networks. These follow the same "4 consecutive" op instruction format as AVX512-4FMAPS.
Vector Byte-Manipulation 2 (AVX512-VBMI2)
Extends AVX512-VBMI by adding support for compress/expand on byte-granular word sizes.
Neural Network Instructions (AVX512-VNNI)
Specialized instructions for Neural Networks. This is the desktop/Xeon version of AVX512-4VNNIW on Knights Mill Xeon Phi.
Bit Algorithms (AVX512-BITALG)
Extends AVX512-VPOPCNT to word and 8-bit and 16-bit words. Adds additional bit manipulation instructions.
Vector AES Instructions (AVX512-VAES)
Extends the existing AES-NI instructions to 512-bit width.
Galois Field Arithmetic (AVX512-VGFI)
Arithmetic for Galois Fields.
Vector Carry-less Multiply (AVX512-VPCLMULQ)
Vectorized version of the pclmulqdq
instruction.