Is there a simple tutorial for me to get up to speed in SSE, SSE2 and SSE3 in GNU C++? How can you do code optimization in SSE?
-
5I wrote a library to make SSE2 optimizations easy - https://github.com/LiraNuna/glsl-sse2 – LiraNuna Apr 18 '11 at 02:53
-
I found this [blog post](http://minchechiu.blogspot.com/2009/05/sse.html) with lots of high quality links on SSE. – Philip Apr 18 '11 at 02:44
-
I found an interesting document here: http://ds9a.nl/gcc-simd/index.html – Charles Brunet Apr 14 '11 at 12:32
5 Answers
Sorry don't know of a tutorial.
Your best bet (IMHO) is to use SSE via the "intrinsic" functions Intel provides to wrap (generally) single SSE instructions. These are made available via a set of include files named *mmintrin.h e.g xmmintrin.h is the original SSE instruction set.
Begin familiar with the contents of Intel's Optimization Reference Manual is a good idea (see section 4.3.1.2 for an example of intrinsics) and the SIMD sections are essential reading. The instruction set reference manuals are pretty helpful too, in that each instruction's documentation includes the "intrinsic" function it corresponds to.
Do spend some time inspecting the assembler produced by the compiler from intrinsics (you'll learn a lot) and on profiling/performance measurement (you'll avoid wasting time SSE-ing code for little return on the effort).
Update 2011-05-31: There is some very nice coverage of intrinsics and vectorization in Agner Fog's optimization PDFs (thanks) although it's a bit spread about (e.g section 12 of the first one and section 5 of the second one). These aren't exactly tutorial material (in fact there's a "these manuals are not for beginners" warning) but they do rightly treat SIMD (whether used via asm, intrinsics or compiler vectorization) as just one part of the larger optimization toolbox.
Update 2012-10-04: A nice little Linux Journal article on gcc vector intrinsics deserves a mention here. More general than just SSE (covers PPC and ARM extensions too). There's a good collection of references on the last page, which drew my attention to Intel's "intrinsics guide".

- 12,039
- 2
- 34
- 79

- 24,582
- 12
- 83
- 135
The most simple optimization to use is to allow gcc to emit SSE code.
Flags: -msse
, -msse2
, -msse3
, -march=
, -mfpmath=sse
For a more concise list about 386 options, see http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options, more exact documentation for your specific compiler version is there: http://gcc.gnu.org/onlinedocs/.
For optimization, always check out Agner Fog's: http://agner.org/optimize/. I think he doesn't have SSE tutorials for intrinsics, but he has some really neat std-c++ tricks and also provides lots of information about coding SSE assembly (which can often be transcribed to intrinsics).

- 37,963
- 15
- 156
- 475

- 38,570
- 8
- 95
- 130
-
-
3Not necessarily. From the manual: "Perform loop vectorization on trees. This flag is enabled by default at -O3. " So the flag itself does not mention a specific platform (could also be processing mutltiple bytes in standard 32bit registers). – Sebastian Mach Feb 01 '10 at 09:11
-
1@SebastianMach: Interesting, yes if vector regs aren't available, GCC will sometimes vectorize bitwise stuff with SWAR, for `char` or `short` elements with 64-bit integer registers, even for `+` when that means more work than for bitwise `^` or something: https://godbolt.org/z/T69WMP36b – Peter Cordes Feb 17 '22 at 21:07
Check out the -mtune
and -march
options, -msse*
, and -mfpmath
of course. All of those enable GCC to do SSE-specific optimizations.
Anything beyond that is the realm of Assembler, I am afraid.

- 67,862
- 21
- 134
- 209
-
6No assembler needed. GCC has extensions to support special datatypes and "function calls" for using MMX/SSE. – Zan Lynx Mar 19 '09 at 07:40
-
1Admittedly, these are thinly disguised wrappers for the assembly so if you cannot program SSE in asm, the extensions won't help you much. – Zan Lynx Mar 19 '09 at 07:42
-
7Actually, intrinsics are more than just wrappers around assembly. They allow the compiler to rearrange your code for maximum performance. But you do need to have a good understanding of how SIMD works. – Dave Van den Eynde Mar 19 '09 at 08:17
-
I realized that when I read your answer, Zan. Didn't know that GCC offered them, so thanks for the hint. (Though I don't actually need SSE for anything I do, I like collecting obscure knowledge. :-D ) – DevSolar Mar 19 '09 at 10:26
-
5Also the compiler does register allocation for you when using intrinsics, which is a major relief. – Axel Gneiting Apr 14 '11 at 12:38
MSDN has pretty good description of SSE compiler built-ins (and those built-ins are de-facto standard, they even work in clang/XCode).
- https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics
- https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2010/kcwz153a(v=vs.100)
The nice thing about that reference is that it shows equivalent pseudocode, so e.g. you can learn that ADDPD instruction is:
r0 := a0 + b0
r1 := a1 + b1
And here's good description of a cryptic shuffle instruction: http://www.songho.ca/misc/sse/sse.html
A simple tutorial? Not that I know of.
But any information about using MMX or any version of SSE will be useful for learning, whether for GCC or for ICC or VC.
To learn about GCC's vector extensions, type "info gcc" and go to Node: Vector Extensions.

- 53,022
- 10
- 79
- 131