Questions tagged [avx]

Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.

AVX provides a new encoding for all previous Intel SSE instructions, giving 3-operand non-destructive operation. It also introduces double-width ymm vector registers, and some new instructions for manipulating them. The floating point vector instructions have 256b versions in AVX, but 256b integer instructions require AVX2. AVX2 also introduced lane-crossing floating-point shuffles.

Mixing AVX (vex-encoded) and non-AVX (old SSE encoding) instructions in the same program requires careful use of VZEROUPPER on Intel CPUs, to avoid a major performance problem. This has led to several performance questions where this was the answer.

Another pitfall for beginners is that most 256b instructions operate on two 128b lanes, rather than treating a ymm register as one long vector. Carefully study which element moves where when using UNPCKLPS and other shuffle / horizontal instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX. See the SSE tag wiki for some guides to SIMD programming techniques, rather than just instruction-set references.

See also Crunching Numbers with AVX and AVX2 for an intro to using AVX intrinsics, with simple examples.


Interesting Q&As / FAQs:

1252 questions
771
votes
11 answers

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

I have recently installed tensorflow (Windows CPU version) and received the following message: Successfully installed tensorflow-1.4.0 tensorflow-tensorboard-0.4.0rc2 Then when I tried to run import tensorflow as tf hello = tf.constant('Hello,…
csg
  • 8,096
  • 3
  • 14
  • 38
83
votes
2 answers

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI[1] is enabled by the compiler ? Ideally for GCC and Clang, but I can manage with only one…
Baptiste Wicht
  • 7,472
  • 7
  • 45
  • 110
75
votes
1 answer

C# and SIMD: High and low speedups. What is happening?

Introduction of the problem I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructions. The problem is that I am getting strange…
Willem124
  • 751
  • 5
  • 6
74
votes
7 answers

How to check if a CPU supports the SSE3 instruction set?

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP. bool CheckSSE3() { int CPUInfo[4] = {-1}; //-- Get number of valid info…
Stiefel
  • 2,677
  • 3
  • 31
  • 42
63
votes
2 answers

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented…
Olivier
  • 1,144
  • 1
  • 8
  • 15
61
votes
10 answers

Optimizations for pow() with const non-integer exponent?

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time. My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher…
Cory Nelson
  • 29,236
  • 5
  • 72
  • 110
60
votes
2 answers

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2. This seems to be verified…
user2088790
59
votes
2 answers

Using AVX CPU instructions: Poor performance without "/arch:AVX"

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX. To use AVX, it is necessary to…
Mike
  • 1,717
  • 2
  • 15
  • 19
59
votes
2 answers

How to choose AVX compare predicate variants

In the Advanced Vector Extensions (AVX) the compare instructions like _m256_cmp_ps, the last argument is a compare predicate. The choices for the predicate overwhelm me. They seem to be a tripple of type, ordering, signaling. E.g. _CMP_LE_OS is…
Bram
  • 7,440
  • 3
  • 52
  • 94
53
votes
5 answers

How to tell if a Linux machine supports AVX/AVX2 instructions?

I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires AVX/AVX2 instruction support. I get an Illegal…
user4979733
  • 3,181
  • 4
  • 26
  • 41
48
votes
2 answers

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I like to know how to do this best in code and I also want to know how it's done internally in the…
user2088790
45
votes
4 answers

Using AVX intrinsics instead of SSE does not improve speed -- why?

I've been using Intel's SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunately, was not the case until now. Probably I am doing a stupid mistake, so I…
user1158218
  • 451
  • 1
  • 4
  • 4
42
votes
4 answers

Intel SSE and AVX Examples and Tutorials

Is there any good C/C++ tutorials or examples for learning Intel SSE and AVX instructions? I found few on Microsoft MSDN and Intel sites, but it would be great to understand it from the basics..
veda
  • 6,416
  • 15
  • 58
  • 78
41
votes
2 answers

Are different mmx, sse and avx versions complementary or supersets of each other?

I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions…
snoukkis
  • 513
  • 1
  • 4
  • 6
34
votes
0 answers

Per-element atomicity of vector load/store and gather/scatter?

Consider an array like atomic shared_array[]. What if you want to SIMD vectorize for(...) sum += shared_array[i].load(memory_order_relaxed)?. Or to search an array for the first non-zero element, or zero a range of it? It's probably…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1
2 3
83 84