Questions tagged [avx512]

AVX512 is Intel's next generation of SIMD instructions that widens vectors to 512-bit, and adds new functionality (masking) and more vector registers.

AVX512 is a set of instruction set extensions for x86 that features 512-bit SIMD vectors.

Wikipedia's AVX-512 article is kept up to date with lists of the sub-extensions, and a handy table of which CPUs support which extensions: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

Other resources:


AVX512 is broken into the sub-extensions including the following. While all AVX512 implementations are required to support AVX512-F, the rest are optional.

  • AVX512-F (Foundation)
  • AVX512-CD (Conflict Detection)
  • AVX512-ER (Exponential and Reciprocal)
  • AVX512-PF (Prefetch)
  • AVX512-BW (Byte and Word instructions)
  • AVX512-DQ (Double-word and quad-word instructions)
  • AVX512-VL (Vector Length)
  • AVX512-IFMA (52-bit Integer Multiply-Add)
  • AVX512-VBMI (Vector Byte-Manipulation)
  • AVX512-VPOPCNT (Vector Population Count)
  • AVX512-4FMAPS (4 x Fused Multiply-Add Single Precision)
  • AVX512-4VNNIW (4 x Neural Network Instructions)
  • AVX512-VBMI2 (Vector Byte-Manipulation 2)
  • AVX512-VNNI (Neural Network Instructions?)
  • AVX512-BITALG (Bit Algorithms)
  • AVX512-VAES (Vector AES Instructions)
  • AVX512-VGFI (Galois Field Arithmetic)
  • AVX512-VPCLMULQ (Vector Carry-less Multiply)

Supporting Processors:

  • Intel Xeon Phi Knights Landing: AVX512-(F, CD, ER, PF)
  • Intel Xeon Phi Knights Mill: AVX512-(F, CD, ER, PF, VPOPCNT, 4FMAPS, 4VNNIW)
  • Intel Skylake Xeon: AVX512-(F, CD, BW, DQ, VL)
  • Intel Cannonlake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI)
  • Intel Ice Lake: AVX512-(F, CD, BW, DQ, VL, IFMA, VBMI, VPOPCNT, VBMI2, VNNI, BITALG, VAES, VGFI, VPCLMULQ)

Foundation (AVX512-F):

All implementations of AVX512 are required to support AVX512-F. AVX512-F expands AVX by doubling the size of the vector width to 512 bits and double the number of registers to 32. It also provides embedded masking by means of 8 opmask registers.

AVX512-F only supports operations on 32-bit and 64-bit words and only operates on zmm (512-bit) registers.

Conflict Detection (AVX512-CD):

AVx512-CD aids vectorization by providing instructions to detect data conflicts.

Exponential and Reciprocal (AVX512-ER):

AVX512-ER provides instructions for computing the reciprocal and exponential functions with increased accuracy. These are used to aid in the fast computation of trigonometric functions.

Prefetch (AVX512-PF):

AVX512-PF provides instructions for vector gather/scatter prefetching.

Byte and Word (AVX512-BW):

AVX512-BW extends AVX512-F by adding support for byte and word (8/16-bit) operations.

Double-word and Quad-word (AVX512-DQ):

AVX512-DQ extends AVX512-F by providing more instructions for 32-bit and 64-bit data.

Vector-Length (AVX512-VL):

AVX512-VL extends AVX512-F by allowing the full AVX512 functionality to operate on xmm and ymm registers (as opposed to only zmm). This includes the masking as well as the increased register count of 32.

52-bit Integer Multiply-Add (AVX512-IFMA):

AVX512-IFMA provides fused multiply-add instructions for 52-bit integers. (Speculation: likely derived from the floating-point FMA hardware)

Vector Bit-Manipulation (AVX512-VBMI):

AVX512-VBMI provides instructions for byte-permutation. It extends the existing permute instructions to byte-granularity.

Vector Population Count (AVX512-VPOPCNT)

A vectorized version of the popcnt instruction for 32-bit and 64-bit words.

4 x Fused Multiply-Add Single Precision (AVX512-4FMAPS)

AVX512-4FMAPS provides instructions that perform 4 consecutive single-precision FMAs.

Neural Network Instructions (AVX512-4VNNIW)

Specialized instructions on 16-bit integers for Neural Networks. These follow the same "4 consecutive" op instruction format as AVX512-4FMAPS.

Vector Byte-Manipulation 2 (AVX512-VBMI2)

Extends AVX512-VBMI by adding support for compress/expand on byte-granular word sizes.

Neural Network Instructions (AVX512-VNNI)

Specialized instructions for Neural Networks. This is the desktop/Xeon version of AVX512-4VNNIW on Knights Mill Xeon Phi.

Bit Algorithms (AVX512-BITALG)

Extends AVX512-VPOPCNT to word and 8-bit and 16-bit words. Adds additional bit manipulation instructions.

Vector AES Instructions (AVX512-VAES)

Extends the existing AES-NI instructions to 512-bit width.

Galois Field Arithmetic (AVX512-VGFI)

Arithmetic for Galois Fields.

Vector Carry-less Multiply (AVX512-VPCLMULQ)

Vectorized version of the pclmulqdq instruction.

349 questions
86
votes
1 answer

memory bandwidth for many channels x86 systems

I'm testing the memory bandwidth on a desktop and a server. Sklyake desktop 4 cores/8 hardware threads Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads The peak bandwidth of the system is Peak bandwidth desktop =…
Z boson
  • 32,619
  • 11
  • 123
  • 226
83
votes
2 answers

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI[1] is enabled by the compiler ? Ideally for GCC and Clang, but I can manage with only one…
Baptiste Wicht
  • 7,472
  • 7
  • 45
  • 110
55
votes
2 answers

SIMD instructions lowering CPU frequency

I read this article. It talked about why AVX-512 instruction: Intel’s latest processors have advanced instructions (AVX-512) that may cause the core, or maybe the rest of the CPU to run slower because of how much power they use. I think on…
HCSF
  • 2,387
  • 1
  • 14
  • 40
38
votes
1 answer

Fast AVX512 modulo when same divisor

I have tried to find divisors to potential factorial primes (number of the form n!+-1) and because I recently bought Skylake-X workstation I thought that I could get some speed up using AVX512 instructions. Algorithm is simple and main step is to…
Nuutti
  • 401
  • 3
  • 5
34
votes
0 answers

Per-element atomicity of vector load/store and gather/scatter?

Consider an array like atomic shared_array[]. What if you want to SIMD vectorize for(...) sum += shared_array[i].load(memory_order_relaxed)?. Or to search an array for the first non-zero element, or zero a range of it? It's probably…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
21
votes
2 answers

Choice between aligned vs. unaligned x86 SIMD instructions

There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr…
MikeF
  • 1,021
  • 9
  • 29
20
votes
1 answer

Dynamically determining where a rogue AVX-512 instruction is executing

I have a process running on an Intel machine that supports AVX-512, but this process doesn't directly use any AVX-512 instructions (asm or intrinsics) and is compiled with -mno-avx512f so that the compiler doesn't insert any AVX-512…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
18
votes
3 answers

How to convert a binary integer number to a hex string?

Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? (i.e. serialize it into a text format.) Digits can be stored in memory or printed on the fly, but storing in memory and printing all at once…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
18
votes
2 answers

How to transpose a 16x16 matrix using SIMD instructions?

I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations. Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how…
lei_z
  • 1,049
  • 2
  • 13
  • 27
17
votes
1 answer

How do the Conflict Detection instructions make it easier to vectorize loops?

The AVX512CD instruction families are: VPCONFLICT, VPLZCNT and VPBROADCASTM. The Wikipedia section about these instruction says: The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free…
zr.
  • 7,528
  • 11
  • 50
  • 84
17
votes
2 answers

In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?

While trying to answer Embedded broadcasts with intrinsics and assembly, I was trying to do something like this: __m512 mul_bcast(__m512 a, float b) { asm( "vbroadcastss %k[scalar], %q[scalar]\n\t" // want vbcast.. %xmm0, %zmm0 …
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
16
votes
0 answers

Costs of new AVX512 instruction - Scatter store

I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them. What I try is to interleave specific data, selected by a mask. My little benchmark loads x*32 byte of aligned data from memory…
Hymir
  • 811
  • 1
  • 10
  • 20
12
votes
1 answer

When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?

When I do a writemasked AVX-512 store, like so: vmovdqu8 [rsi] {k1}, zmm0 Will the instruction fault if some portion of the memory accessed at [rsi, rsi + 63] is not mapped but the writemask is zero for all those locations (i.e., the data is not…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
12
votes
3 answers

What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?

Say, I want to clear 4 zmm registers. Will the following code provide the fastest speed? vpxorq zmm0, zmm0, zmm0 vpxorq zmm1, zmm1, zmm1 vpxorq zmm2, zmm2, zmm2 vpxorq zmm3, zmm3, zmm3 On AVX2, if I wanted to clear ymm registers, vpxor was…
Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
12
votes
1 answer

Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?)

So far I have managed to find out that: SSE and SSE2 are mandatory for Windows 8 and later (and of course for any 64-bit OS) AVX is only supported by Windows 7 SP1 or later Are there any caveats regarding using SSE3, SSSE3, SSE4.1, SSE 4.2, AVX2…
Alexey
  • 1,354
  • 13
  • 30
1
2 3
23 24