Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.

Interesting Q&As / FAQs:

Why does my code with AVX crash with segfault/access violation? Most likely you don't align the data when needed. 256-bit memory operands (__m256* types) require 32 bytes alignment, 512-bit memory operands (__m512* types) require 64 bytes alignment, except for explicitly unaligned operations.
How to solve the 32-byte-alignment issue for AVX load/store operations? explains alignas, aligned_alloc, _aligned_malloc, C++17 aligned new, etc, and use of unaligned loadu / storeu intrinsics.
Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE works. , Includes in-lane vs. lane-crossing for AVX.
Do 128bit cross lane operations in AVX512 give better performance? In-lane can still be lower latency, but shuffle throughput is often the bigger problem. Tricks like unaligned / overlapping loads can reduce the number shuffles.
Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) AVX has to be supported by OS, not just by CPU. Fortunately, there's a way to detect its support in OS-independent way.

1252 questions

771

votes

11 answers

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

I have recently installed tensorflow (Windows CPU version) and received the following message: Successfully installed tensorflow-1.4.0 tensorflow-tensorboard-0.4.0rc2 Then when I tried to run import tensorflow as tf hello = tf.constant('Hello,…

asked Nov 02 '17 at 06:10

csg

8,096
3
14
38

votes

2 answers

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI[1] is enabled by the compiler ? Ideally for GCC and Clang, but I can manage with only one…

gcc clang sse avx avx512

asked Mar 09 '15 at 10:23

Baptiste Wicht

7,472
7
45
110

votes

1 answer

C# and SIMD: High and low speedups. What is happening?

Introduction of the problem I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructions. The problem is that I am getting strange…

c# performance x86-64 simd avx

asked Jul 09 '19 at 11:42

Willem124

votes

7 answers

How to check if a CPU supports the SSE3 instruction set?

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP. bool CheckSSE3() { int CPUInfo[4] = {-1}; //-- Get number of valid info…

c++ sse instruction-set avx cpuid

asked May 25 '11 at 08:49

Stiefel

2,677
3
31
42

votes

2 answers

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented…

performance x86 intel sse avx

asked Dec 23 '16 at 15:09

Olivier

1,144
1
8
15

votes

10 answers

Optimizations for pow() with const non-integer exponent?

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time. My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher…

c++ math optimization avx exponent

asked Jun 25 '11 at 02:17

Cory Nelson

29,236
5
72
110

votes

2 answers

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2. This seems to be verified…

cpu intel cpu-architecture avx flops

asked Mar 27 '13 at 09:48

user2088790

votes

2 answers

Using AVX CPU instructions: Poor performance without "/arch:AVX"

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX. To use AVX, it is necessary to…

c++ performance visual-studio-2010 sse avx

asked Oct 20 '11 at 17:40

Mike

1,717
2
15
19

votes

2 answers

How to choose AVX compare predicate variants

In the Advanced Vector Extensions (AVX) the compare instructions like _m256_cmp_ps, the last argument is a compare predicate. The choices for the predicate overwhelm me. They seem to be a tripple of type, ordering, signaling. E.g. _CMP_LE_OS is…

simd avx

asked Jun 07 '13 at 15:52

Bram

7,440
3
52
94

votes

5 answers

How to tell if a Linux machine supports AVX/AVX2 instructions?

I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires AVX/AVX2 instruction support. I get an Illegal…

linux unix avx suse avx2

asked May 27 '16 at 09:40

user4979733

3,181
4
26
41

votes

2 answers

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I like to know how to do this best in code and I also want to know how it's done internally in the…

c sse cpu-architecture avx fma

asked Apr 10 '13 at 18:02

user2088790

votes

4 answers

Using AVX intrinsics instead of SSE does not improve speed -- why?

I've been using Intel's SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunately, was not the case until now. Probably I am doing a stupid mistake, so I…

c++ performance gcc sse avx

asked Jan 19 '12 at 10:47

user1158218

votes

4 answers

Intel SSE and AVX Examples and Tutorials

Is there any good C/C++ tutorials or examples for learning Intel SSE and AVX instructions? I found few on Microsoft MSDN and Intel sites, but it would be great to understand it from the basics..

intel sse vectorization avx

asked Nov 27 '12 at 04:03

veda

6,416
15
58
78

votes

2 answers

Are different mmx, sse and avx versions complementary or supersets of each other?

I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions…

x86 sse avx mmx

asked Jul 18 '15 at 11:50

snoukkis

votes

0 answers

Per-element atomicity of vector load/store and gather/scatter?

Consider an array like atomic shared_array[]. What if you want to SIMD vectorize for(...) sum += shared_array[i].load(memory_order_relaxed)?. Or to search an array for the first non-zero element, or zero a range of it? It's probably…

x86 atomic sse avx avx512

asked Sep 02 '17 at 09:56

Peter Cordes

328,167
45
605
847

2 3

…

83 84 Next