Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

  • Scalar -> Vector register broadcast
  • Gather loads for loading a vector from different memory locations.
  • Masked memory loads/stores
  • New permute instructions
  • Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions
53
votes
5 answers

How to tell if a Linux machine supports AVX/AVX2 instructions?

I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires AVX/AVX2 instruction support. I get an Illegal…
user4979733
  • 3,181
  • 4
  • 26
  • 41
48
votes
6 answers

AVX2 what is the most efficient way to pack left based on a mask?

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2? I've seen in SSE where it was done like…
Froglegs
  • 1,095
  • 1
  • 11
  • 21
37
votes
2 answers

Why is Intel Haswell XEON CPU sporadically miscomputing FFTs and ART?

During the last days I observed a behaviour of my new workstation I couldn't explain. Doing some research on this problem, there might be a possible bug in the INTEL Haswell architecture as well as in the current Skylake Generation. Before writing…
semm0
  • 947
  • 7
  • 18
34
votes
2 answers

In what situation would the AVX2 gather instructions be faster than individually loading the data?

I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point array is permuted and added to another. In c, this can be implemented…
infinitesimal
  • 343
  • 3
  • 4
26
votes
2 answers

How are the gather instructions in AVX2 implemented?

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches…
Anuj Kalia
  • 803
  • 8
  • 16
26
votes
5 answers

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform…
Satya Arjunan
  • 575
  • 4
  • 11
23
votes
2 answers

Fastest way to multiply an array of int64_t?

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the…
23
votes
2 answers

AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła's extremely clever SSE3 popcount implementation, I coded an AVX2 equivalent solution, this time using 256 bit registers. l was expecting at least a 30%-40%…
BlueStrat
  • 2,202
  • 17
  • 27
22
votes
5 answers

Transpose an 8x8 float using AVX/AVX2

Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However,…
DavidS
  • 1,660
  • 1
  • 12
  • 26
19
votes
4 answers

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the…
Luigi Castelli
  • 676
  • 2
  • 6
  • 13
19
votes
1 answer

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product…
cyrusbehr
  • 1,100
  • 1
  • 12
  • 32
19
votes
2 answers

Haswell memory access

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store. #include #include #include #include…
edorado
  • 275
  • 2
  • 10
18
votes
3 answers

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code. acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec)); //acc in this point contains {2.0, 8.0, 18.0, 32.0} acc = _mm256_hadd_pd(acc, acc); result[i] = ((double*)&acc)[0] +…
Peter
  • 435
  • 1
  • 3
  • 11
15
votes
1 answer

Is it possible to use SIMD instructions in Rust?

In C/C++, you can use intrinsics for SIMD (such as AVX and AVX2) instructions. Is there a way to use SIMD in Rust?
pythonic
  • 20,589
  • 43
  • 136
  • 219
15
votes
3 answers

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
1
2 3
45 46