Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

AVX2 adds support for for 256-bit integer SIMD. Most existing 128-bit SSE instructions are extended to 256-bit. AVX2 uses the same VEX encoding scheme as AVX instructions.

See the x86 tag page for guides and other resources for programming and optimising programs using AVX2.

As with AVX, common problems are lack of VZEROUPPER, and non-obvious data movement in shuffles, due to the 128b lanes design.

AVX2 also adds the following new functionality:

Scalar -> Vector register broadcast
Gather loads for loading a vector from different memory locations.
Masked memory loads/stores
New permute instructions
Element-wise bit-shifting that allows each element of a vector to be shifted by a different amount.

The AVX2 instruction set was introduced together with FMA3 (3-operand Fused-Multiply Add) in 2013 with Intel's Haswell processor line. (AMD CPUs from Piledriver onwards support FMA3, but AVX2 support was not introduced then.)

683 questions

votes

5 answers

How to tell if a Linux machine supports AVX/AVX2 instructions?

I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires AVX/AVX2 instruction support. I get an Illegal…

asked May 27 '16 at 09:40

user4979733

3,181
4
26
41

votes

6 answers

AVX2 what is the most efficient way to pack left based on a mask?

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2? I've seen in SSE where it was done like…

c++ vectorization sse simd avx2

asked Apr 29 '16 at 07:30

Froglegs

1,095
1
11
21

votes

2 answers

Why is Intel Haswell XEON CPU sporadically miscomputing FFTs and ART?

During the last days I observed a behaviour of my new workstation I couldn't explain. Doing some research on this problem, there might be a possible bug in the INTEL Haswell architecture as well as in the current Skylake Generation. Before writing…

intel cpu-architecture processor avx2

asked Jan 19 '16 at 09:34

semm0

votes

2 answers

In what situation would the AVX2 gather instructions be faster than individually loading the data?

I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point array is permuted and added to another. In c, this can be implemented…

assembly optimization x86 vectorization avx2

asked Jul 15 '14 at 11:02

infinitesimal

votes

2 answers

How are the gather instructions in AVX2 implemented?

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches…

intel ram simd avx avx2

asked Feb 14 '14 at 08:39

Anuj Kalia

votes

5 answers

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform…

c x86 simd avx avx2

asked Feb 07 '14 at 07:55

Satya Arjunan

votes

2 answers

Fastest way to multiply an array of int64_t?

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the…

c vectorization multiplication avx avx2

asked May 18 '16 at 10:01

Hélder Gonçalves

votes

2 answers

AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła's extremely clever SSE3 popcount implementation, I coded an AVX2 equivalent solution, this time using 256 bit registers. l was expecting at least a 30%-40%…

c++ performance sse avx2

asked Jul 17 '15 at 00:51

BlueStrat

2,202
17
27

votes

5 answers

Transpose an 8x8 float using AVX/AVX2

Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However,…

simd avx avx2

asked Sep 02 '14 at 11:51

DavidS

1,660
1
12
26

votes

4 answers

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the…

x86 simd avx vector-processing avx2

asked Mar 20 '12 at 21:48

Luigi Castelli

votes

1 answer

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product…

c++ simd avx2 dot-product fma

asked Dec 27 '19 at 00:23

cyrusbehr

1,100
1
12
32

votes

2 answers

Haswell memory access

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store. #include #include #include #include…

performance x86 cpu-architecture avx2 intel-pmu

asked Oct 27 '13 at 18:08

edorado

votes

3 answers

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code. acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec)); //acc in this point contains {2.0, 8.0, 18.0, 32.0} acc = _mm256_hadd_pd(acc, acc); result[i] = ((double*)&acc)[0] +…

c++ optimization sse avx avx2

asked Apr 20 '18 at 12:27

Peter

votes

1 answer

Is it possible to use SIMD instructions in Rust?

In C/C++, you can use intrinsics for SIMD (such as AVX and AVX2) instructions. Is there a way to use SIMD in Rust?

rust simd avx avx2

asked Mar 21 '17 at 21:58

pythonic

20,589
43
136
219

votes

3 answers

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger…

floating-point x86 simd avx2 fma

asked Dec 30 '16 at 22:54

BeeOnRope

60,350
16
207
386

2 3

…

45 46 Next