41

I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant.

The x86 architecture has accumulated a lot of math/multimedia extensions over decades:

  • MMX
  • 3DNow!
  • SSE
  • SSE2
  • SSE3
  • SSSE3
  • SSE4
  • AVX
  • AVX2
  • AVX512
  • Did I forget something?

Are the newer ones supersets of the older ones and vice versa? Or are they complementary?

Are some of them deprecated? Which of these are still relevant? I've heard references to "legacy SSE".

Are some of them mutually exclusive? I.e. do they share the same hardware parts?

Which should I use together to maximize hardware utilization on modern Intel / AMD CPUs? For sake of argument, let's assume I can find appropriate uses for the instructions... heating my house with the CPU if nothing else.

snoukkis
  • 513
  • 1
  • 4
  • 6

2 Answers2

25

They are complementary.

Each new instruction set extension add new instructions and eventually a new programming model (new registers for example).

None are deprecated, deprecating instructions is almost impossible to do for compatibility reasons. However some optional extensions may be absent or removed from newer models (like the FMA4 of AMD) if not very wide spread.
Some are vestigial though, everything that can be done with FPU and MMX for example can be done more efficiently with SSE+.

They are not mutually exclusive in the sense that you can use one or another, after all they are instructions not modes of operation (like real vs protected mode for example).
The only possible "conflict" is between MMX and FPU as they share the lower part of the same set of register but have different programming model.
The new vector registers have grown from 128 bit to 256 bit and to 512 bit, each time the previous registers have become the low part of the newer ones.

You can use all them together, they offer specific hardware support implementing simple operations.

They are like Lego bricks, you are only limited by your imagination (or the imagination of the designers).


Here a simple list of this instruction set extensions.
Only some features are listed, for the complete reference see Intel Manual Vol1 from chapter 9 to 14.

See also https://hjlebbink.github.io/x86doc/ for a table of contents of Intel's volume 2 (instruction set reference) manual, with a list of extensions that added instructions to that manual entry.

  • MMX
    Introduce eight 64 bit registers (MM0-MM7) and instructions to work with eight signed/unsigned bytes, four signed/unsigned words, two signed/unsigned dwords.

  • 3DNow!
    Add support for single precision floating point operand to MMX. Few operation supported, for example addition, subtraction, multiplication.

  • SSE
    Introduce eight/sixteen 128 bit registers (XMM0-XMM7/15) and instruction to work with four single precision floating point operands. Add integer operations on MMX registers too. (The MMX-integer part of SSE is sometimes called MMXEXT, and was implemented on a few non-Intel CPUs without xmm registers and the floating point part of SSE.)

  • SSE2
    Introduces instruction to work with 2 double precision floating point operands, and with packed byte/word/dword/qword integers in 128-bit xmm registers.

  • SSE3
    Add a few varied instructions (mostly floating point), including a special kind of unaligned load (lddqu) that was better on Pentium 4, synchronization instruction, horizontal add/sub.

  • SSSE3
    Again a varied set of instructions, mostly integer. The first shuffle that takes its control operand from a register instead of hard-coded (pshufb). More horizontal processing, shuffle, packing/unpacking, mul+add on bytes, and some specialized integer add/mul stuff.

  • SSE4 (SSE4.1, SSE4.2)
    Add a lot of instructions: Filling in a lot of the gaps by providing min and max and other operations for all integer data types (especially 32-bit integer had been lacking), where previously integer min was only available for unsigned bytes and signed 16-bit. Also scaling, FP rounding, blending, linear algebra operation, text processing, comparisons. Also a non temporal load for reading video memory, or copying it back to main memory. (Previously only NT stores were available.)

  • AESNI
    Add support for accelerating AES symmetric encryption/decryption.

  • AVX Add eight/sixteen 256 bit registers (YMM0-YMM7/15).
    Support all previous floating point datatype. Three operand instructions.

  • FMA
    Add Fused Multiply Add and correlated instructions.

  • AVX2
    Add support for integer data types.

  • AVX512F
    Add eight/thirty-two 512 bit registers (ZMM0-ZMM7/31) and eight 64-bit mask register (k0-k7). Promote most previous instruction to 512 bit wide. Optional parts of AVX512 add instruction for exponentials & reciprocals (AVX512ER), scatter/gather prefetching (AVX512PF), scatter conflict detection (AVX512CD), compress, expand.

  • IMCI (Intel Xeon Phi)
    Early development of AVX512 for the first-gen Intel Xeon Phi (Knight's Corner) coprocessor.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Nice explanation. I would just also mention that in a similar vein to the MMX/FP overlap, XMMi is actually the lower portion of YMMi and that, in turn, is the lower portion of ZMMi. – hayesti Jul 18 '15 at 16:20
  • @hayesti, yep. I'm updating the answer. –  Jul 18 '15 at 17:40
  • 4
    Not sure it's correct to say anything that can be done with the FPU can be done with SSE more efficiently. Try doing 80-bit floats with SSE... – user541686 Jul 18 '15 at 19:05
  • Hmm, knm241 & Peter Cordes, both of you have very good answers which complement each other. Care to merge the information? Thank you both! – snoukkis Jul 18 '15 at 23:27
  • You forgot SSE4a and XOP. XOP is particularly important because it adds for example 64-bit compare instructions that Intel only gets with AVX512. It completes the integer set of operations which the scalar instructions have except for `adc` and `mulx`. – Z boson Jul 20 '15 at 10:54
23

I recently updated the tag wikis for SSE, AVX, and x86 (and SSE2, avx2). They cover a lot of this. tl;dr summary: AVX rolls up all the previous SSE versions, and provides 3-operand versions of those instructions. Also 256b versions of most FP (AVX) and int (AVX2) insns.

For summaries of the various SSE versions, see wikipedia, or knm241's more-detailed answer.

We don't really think of that making SSE obsolete. More like, think of AVX as a new and better version of the same old SSE instructions. They're still in the ref manual under their non-AVX names (PSHUFB, not VPSHUFB, for example.) You can mix AVX and SSE code, as long as you use VZEROUPPER when needed to avoid the performance problem from mixing VEX with non-VEX insns (on Intel). So there is some annoyance to dealing with cases where you have to call into libraries that might run non-VEX SSE instructions, or where your code uses SSE FP math, but also has some AVX code to be run only if the CPU supports it.

If CPU-compatibility was a non-issue, the legacy-SSE versions of vector instructions would be truly obsolete, like MMX is now. AVX/AVX2 is at least slightly better in every way, if you count the VEX-encoded 128b version an insn as AVX, not SSE. Sometimes you'd still use 128b registers because your data only comes in chunks that big, but more often working with 256b registers to do the same op on twice as much data at once.

SSE/AVX/x87-FP/integer instructions all use the same execution ports. You can't get more done in parallel by mixing them. (except on Haswell, where one of the 4 ALU ports can only handle non-vector insns, like GP reg ops and branches).

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Hmm, knm241 & Peter Cordes, both of you have very good answers which complement each other. Care to merge the information? Thank you both! – snoukkis Jul 18 '15 at 23:29
  • I'll mark this one correct because it's shorter and so the other answer will still be visible after this one. – snoukkis Jul 18 '15 at 23:35
  • I'm just going to reference knm's answer in my 2nd paragraph. I want to keep my answer as short as possible, so it's on target. A list of what was in each SSE version is available elsewhere. Maybe not summarized as nicely as knm241's answer. I did add a paragraph about how if CPU compat was a non issue, we really would never use the non-VEX encoding of vector instructions in new code. I think the VEX encoding is sometimes 1 byte longer, but usually not. The only reason not to would be to avoid `vzeroupper` when calling SSE code you can't recompile/reassemble. – Peter Cordes Jul 19 '15 at 00:48
  • thanks, that paragraph made the question of deprecation more clear – snoukkis Jul 19 '15 at 09:28