Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

See the x86 tag wiki for guides and other resources for programming and optimising programs using x86 vector extensions.

SIMD / SSE basics: What are the 128-bit to 512-bit registers used for? with links to many examples.

SSE/SIMD vector programming guides, focused on the SIMD aspect rather than general x86:

Agner Fog's Optimizing Assembly guide has a chapter on vectors, including tables of data movement instructions: broadcasts within a vector, combine data between two vectors, different kinds of shuffles, etc. It's great for finding the right instruction (on intrinsic) for the data movement you need.
Crunching Numbers with AVX and AVX2: An intro with examples of using C++ intrinsics
Slides + text: SIMD at Insomniac Games (GDC 2015): intro to SIMD, and some specific examples: checking all doors against all characters in a level. Advanced tricks: Filtering an array into a smaller array (using Left-packing based on a compare mask), with an SSSE3 pshufb solution and an SSE2 move-distance solution. Also: generating N-bit masks for variable-per-element N. Including a clever float-exponent based SSE2 version.

Instruction-set / intrinsics reference guides (see the x86 tag wiki for more links)

Intel's vector intrinsics finder/search (very good): search by asm mnemonic or C intrinsic name. Filter by type and/or by instruction-set extension family (e.g. exclude AVX512 and later). Occasionally buggy, esp. the performance info. (Look at Agner Fog's tables for performance info, although it has occasional errors or typos, too).
Intel's manuals, including instruction set reference manual. Very detailed description of what every instruction does, using pseudo-code. These manuals are accurate much more often than the intrinsics guide.
x86/x64 SIMD Instruction List (SSE to AVX512) Beta: A nice compact table listing instruction mnemonics and their intrinsics, broken down by type and element-size. Detailed pages with graphical data-movement diagrams for each instruction.

Miscellaneous specific things:

Shuffling by mask with Intel AVX explains how shuffle-control vectors and _MM_SHUFFLE work, including in-lane vs. lane-crossing for AVX.
SSE interleave/merge/combine 2 vectors using a mask, per-element conditional move? Blends, especially variable blends (blendvps)
What are the best instruction sequences to generate vector constants on the fly?. In C/C++, almost always prefer _mm_set or _mm_set1 to initialize local variables (not globals), rather than defining arrays and loading from them.
print a __m128i variable: How to safely and portably access the elements of a vector, and how to debug-print them.

Streaming SIMD Extensions (SSE) basics

Together, the various SSE extensions allow working with 128b vectors of float, double, or integer (from 8b to 64b) elements. There are instructions for arithmetic, bitwise operations, shuffles, blends (conditional moves), compares, and some more-specialized operations (e.g. SAD for multimedia, carryless-multiply for crypto/finite-field math, strings (for strstr() and so on)). FP sqrt is provided, but unlike the x87 FPU, math library functions like sin must be implemented by software. SSE for scalar FP math has replaced x87 floating point, now that hardware support is near-universal.

Efficient use usually requires programs to store their data in contiguous chunks, so it can be loaded in chunks of 16B and used without too much shuffling. SSE doesn't offer loads / stores with a stride, only packed. (SoA vs. AoS: structs-of-arrays vs. arrays-of-structs). Alignment requirements on memory operands can also be a hurdle, even though modern hardware has fast unaligned loads/stores.

While there are many instructions available, the instruction set is not very orthogonal. It's not uncommon to find the operation you need, but only available for elements of a different size than you're working with. Another good example is that floating point shuffles (SHUFPS) have different semantics than 32b-integer shuffles (PSHUFD).

Details

SSE added new architectural registers (xmm0-xmm7, 128b each (xmm0-xmm15 in 64bit mode)), requiring OS support to save/restore them on context switches. The previous MMX extensions (for integer SIMD) reused the x87 FP registers.

Intel introduced MMX, original-SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2. AMD's XOP (a revision of their SSE5 plans) was never picked up by Intel, and will be dropped even by future AMD designs. The instruction-set war between Intel and AMD has led to many sub-optimal results, which makes instruction decoders in CPUs require more power and transistors. (And limits opportunity for further extensions).

"SSE" commonly refers to the whole family of extensions. Writing programs that make sure to only use instructions supported by the machine they run on is necessary, and implied, and not worth cluttering our language with. (Setting function pointers is a good way to detect what's supported once at startup, avoiding a branch to select an appropriate function every time one is needed.)

Further SSE extensions are not expected: AVX introduced a new 3-operand version of all SSE instructions, as well as some new features (including dropping alignment requirements, except for explicitly-aligned moves like vmovdqa). Further vector extensions will be called AVX-something, until Intel comes up with something different enough to change the name again.

History

SSE, first introduced with the Pentium III in 1999, was Intel's reply to AMD's 3DNow extension in 1998.

The original-SSE added vector single-precision floating point math. Integer instructions to operate on xmm registers (instead of 64bit mmx regs) didn't appear until SSE2.

Original-SSE can be considered somewhat half-hearted insofar as it only covered the most basic operations and suffered from severe limitations both in functionality and performance, making it mostly useful for a few select applications, such as audio or raster image processing.

Most of SSE's limitations have been ameliorated with the SSE2 instruction set, the only notable limitation remaining to date is the lack of horizontal addition or a dot product operation in both an efficient way and widely available. While SSE3 and SSE4.1 added horizontal add and dot product instructions, they're usually slower than manual shuffle+add. Only use them at the end of a loop.

The lack of cross-manufacturer support made software development with SSE a challenge during the initial years. With AMD's adoption of SSE2 into its 64bit processors during 2003/2004, this problem gradually disappeared. As of today, there exist virtually no processors without SSE/SSE2 support. SSE2 is part of x86-64 baseline, with twice as many vector registers available in 64bit mode.

2314 questions

171

votes

5 answers

Header files for x86 SIMD intrinsics

Which header files provide the intrinsics for the different x86 SIMD instruction set extensions (MMX, SSE, AVX, ...)? It seems impossible to find such a list online. Correct me if I'm wrong.

x86 header-files sse simd intrinsics

asked Jun 27 '12 at 14:44

fredoverflow

256,549
94
388
662

157

votes

3 answers

What is the meaning of "non temporal" memory accesses in x86

This is a somewhat low-level question. In x86 assembly there are two SSE instructions: MOVDQA xmmi, m128 and MOVNTDQA xmmi, m128 The IA-32 Software Developer's Manual says that the NT in MOVNTDQA stands for Non-Temporal, and that otherwise…

x86 sse assembly

asked Aug 31 '08 at 20:18

Nathan Fellman

122,701
101
260
319

112

votes

6 answers

Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get…

performance assembly floating-point x86 sse

asked Oct 06 '09 at 23:45

Crashworks

40,496
12
101
170

100

votes

9 answers

Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

Let's say the bottleneck of my Java program really is some tight loops to compute a bunch of vector dot products. Yes I've profiled, yes it's the bottleneck, yes it's significant, yes that's just how the algorithm is, yes I've run Proguard to…

java floating-point jit sse vectorization

asked May 28 '12 at 12:48

Sean Owen

66,182
23
141
173

votes

2 answers

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI[1] is enabled by the compiler ? Ideally for GCC and Clang, but I can manage with only one…

gcc clang sse avx avx512

asked Mar 09 '15 at 10:23

Baptiste Wicht

7,472
7
45
110

votes

5 answers

Fastest way to do horizontal SSE vector sum (or other reduction)

Given a vector of three (or four) floats. What is the fastest way to sum them? Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it? What's the cost to moving to the FPU, then faddp, faddp?…

assembly optimization floating-point sse simd

asked Aug 09 '11 at 13:16

FeepingCreature

3,648
2
26
25

votes

7 answers

How to check if a CPU supports the SSE3 instruction set?

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP. bool CheckSSE3() { int CPUInfo[4] = {-1}; //-- Get number of valid info…

c++ sse instruction-set avx cpuid

asked May 25 '11 at 08:49

Stiefel

2,677
3
31
42

votes

11 answers

Fast method to copy memory with translation - ARGB to BGR

Overview I have an image buffer that I need to convert to another format. The origin image buffer is four channels, 8 bits per channel, Alpha, Red, Green, and Blue. The destination buffer is three channels, 8 bits per channel, Blue, Green, and…

c x86 rgb sse micro-optimization

asked Jul 24 '11 at 00:07

Adam Davis

91,931
60
264
330

votes

2 answers

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented…

performance x86 intel sse avx

asked Dec 23 '16 at 15:09

Olivier

1,144
1
8
15

votes

8 answers

How is a vector's data aligned?

If I want to process data in a std::vector with SSE, I need 16 byte alignment. How can I achieve that? Do I need to write my own allocator? Or does the default allocator already align to 16 byte boundaries?

c++ vector sse memory-alignment allocator

asked Dec 10 '11 at 11:38

fredoverflow

256,549
94
388
662

votes

2 answers

Using AVX CPU instructions: Poor performance without "/arch:AVX"

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX. To use AVX, it is necessary to…

c++ performance visual-studio-2010 sse avx

asked Oct 20 '11 at 17:40

Mike

1,717
2
15
19

votes

1 answer

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

I'm a newbie at instruction optimization. I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays. The C code is as follows: float dotp( const float x[], const float y[],…

c assembly x86 sse micro-optimization

asked Jul 15 '17 at 01:14

Forward

votes

1 answer

How are denormalized floats handled in C#?

Just read this fascinating article about the 20x-200x slowdowns you can get on Intel CPUs with denormalized floats (floating point numbers very close to 0). There is an option with SSE to round these off to 0, restoring performance when such…

c# .net performance intel sse

asked Apr 07 '14 at 14:38

Robin Rodricks

110,798
141
398
607

votes

5 answers

SSE SSE2 and SSE3 for GNU C++

Is there a simple tutorial for me to get up to speed in SSE, SSE2 and SSE3 in GNU C++? How can you do code optimization in SSE?

c++ optimization simd sse sse2

asked Mar 19 '09 at 07:32

yoitsfrancis

4,278
14
44
73

votes

5 answers

Where can I find an official reference listing the operation of SSE intrinsic functions?

Is there an official reference listing the operation of the SSE intrinsic functions for GCC, i.e. the functions in the <*mmintrin.h> header files?

c++ c gcc sse simd

asked Aug 23 '11 at 06:07

NGaffney

1,542
1
15
16

2 3

…

99 100 Next