Are different mmx, sse and avx versions complementary or supersets of each other?

Question

I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant.

The x86 architecture has accumulated a lot of math/multimedia extensions over decades:

MMX
3DNow!
SSE
SSE2
SSE3
SSSE3
SSE4
AVX
AVX2
AVX512
Did I forget something?

Are the newer ones supersets of the older ones and vice versa? Or are they complementary?

Are some of them deprecated? Which of these are still relevant? I've heard references to "legacy SSE".

Are some of them mutually exclusive? I.e. do they share the same hardware parts?

Which should I use together to maximize hardware utilization on modern Intel / AMD CPUs? For sake of argument, let's assume I can find appropriate uses for the instructions... heating my house with the CPU if nothing else.

Off the top of my head: none are deprecated; SSE is a series, as is AVX; mixing SSE and AVX is not a good idea. I'm sure Wikipedia or official docs can resolve the details. — Jeff Hammond, Jul 18 '15 at 15:26
One could argue that the aligned load instructions are deprecated. There is no reason to use them since Nehalem, at least on paper. — Z boson, Jul 20 '15 at 10:56

score 25 · Answer 1 · edited Sep 11 '17 at 17:00

They are complementary.

Each new instruction set extension add new instructions and eventually a new programming model (new registers for example).

None are deprecated, deprecating instructions is almost impossible to do for compatibility reasons. However some optional extensions may be absent or removed from newer models (like the FMA4 of AMD) if not very wide spread.
Some are vestigial though, everything that can be done with FPU and MMX for example can be done more efficiently with SSE+.

They are not mutually exclusive in the sense that you can use one or another, after all they are instructions not modes of operation (like real vs protected mode for example).
The only possible "conflict" is between MMX and FPU as they share the lower part of the same set of register but have different programming model.
The new vector registers have grown from 128 bit to 256 bit and to 512 bit, each time the previous registers have become the low part of the newer ones.

You can use all them together, they offer specific hardware support implementing simple operations.

They are like Lego bricks, you are only limited by your imagination (or the imagination of the designers).

Here a simple list of this instruction set extensions.
Only some features are listed, for the complete reference see Intel Manual Vol1 from chapter 9 to 14.

See also https://hjlebbink.github.io/x86doc/ for a table of contents of Intel's volume 2 (instruction set reference) manual, with a list of extensions that added instructions to that manual entry.

MMX
Introduce eight 64 bit registers (MM0-MM7) and instructions to work with eight signed/unsigned bytes, four signed/unsigned words, two signed/unsigned dwords.
3DNow!
Add support for single precision floating point operand to MMX. Few operation supported, for example addition, subtraction, multiplication.
SSE
Introduce eight/sixteen 128 bit registers (XMM0-XMM7/15) and instruction to work with four single precision floating point operands. Add integer operations on MMX registers too. (The MMX-integer part of SSE is sometimes called MMXEXT, and was implemented on a few non-Intel CPUs without xmm registers and the floating point part of SSE.)
SSE2
Introduces instruction to work with 2 double precision floating point operands, and with packed byte/word/dword/qword integers in 128-bit xmm registers.
SSE3
Add a few varied instructions (mostly floating point), including a special kind of unaligned load (lddqu) that was better on Pentium 4, synchronization instruction, horizontal add/sub.
SSSE3
Again a varied set of instructions, mostly integer. The first shuffle that takes its control operand from a register instead of hard-coded (pshufb). More horizontal processing, shuffle, packing/unpacking, mul+add on bytes, and some specialized integer add/mul stuff.
SSE4 (SSE4.1, SSE4.2)
Add a lot of instructions: Filling in a lot of the gaps by providing min and max and other operations for all integer data types (especially 32-bit integer had been lacking), where previously integer min was only available for unsigned bytes and signed 16-bit. Also scaling, FP rounding, blending, linear algebra operation, text processing, comparisons. Also a non temporal load for reading video memory, or copying it back to main memory. (Previously only NT stores were available.)
AESNI
Add support for accelerating AES symmetric encryption/decryption.
AVX Add eight/sixteen 256 bit registers (YMM0-YMM7/15).
Support all previous floating point datatype. Three operand instructions.
FMA
Add Fused Multiply Add and correlated instructions.
AVX2
Add support for integer data types.
AVX512F
Add eight/thirty-two 512 bit registers (ZMM0-ZMM7/31) and eight 64-bit mask register (k0-k7). Promote most previous instruction to 512 bit wide. Optional parts of AVX512 add instruction for exponentials & reciprocals (AVX512ER), scatter/gather prefetching (AVX512PF), scatter conflict detection (AVX512CD), compress, expand.
IMCI (Intel Xeon Phi)
Early development of AVX512 for the first-gen Intel Xeon Phi (Knight's Corner) coprocessor.

Nice explanation. I would just also mention that in a similar vein to the MMX/FP overlap, XMMi is actually the lower portion of YMMi and that, in turn, is the lower portion of ZMMi. — hayesti, Jul 18 '15 at 16:20
Not sure it's correct to say anything that can be done with the FPU can be done with SSE more efficiently. Try doing 80-bit floats with SSE... — user541686, Jul 18 '15 at 19:05
Hmm, knm241 & Peter Cordes, both of you have very good answers which complement each other. Care to merge the information? Thank you both! — snoukkis, Jul 18 '15 at 23:27
You forgot SSE4a and XOP. XOP is particularly important because it adds for example 64-bit compare instructions that Intel only gets with AVX512. It completes the integer set of operations which the scalar instructions have except for `adc` and `mulx`. — Z boson, Jul 20 '15 at 10:54

score 23 · Accepted Answer · edited May 23 '17 at 12:06

I recently updated the tag wikis for SSE, AVX, and x86 (and SSE2, avx2). They cover a lot of this. tl;dr summary: AVX rolls up all the previous SSE versions, and provides 3-operand versions of those instructions. Also 256b versions of most FP (AVX) and int (AVX2) insns.

For summaries of the various SSE versions, see wikipedia, or knm241's more-detailed answer.

We don't really think of that making SSE obsolete. More like, think of AVX as a new and better version of the same old SSE instructions. They're still in the ref manual under their non-AVX names (PSHUFB, not VPSHUFB, for example.) You can mix AVX and SSE code, as long as you use VZEROUPPER when needed to avoid the performance problem from mixing VEX with non-VEX insns (on Intel). So there is some annoyance to dealing with cases where you have to call into libraries that might run non-VEX SSE instructions, or where your code uses SSE FP math, but also has some AVX code to be run only if the CPU supports it.

If CPU-compatibility was a non-issue, the legacy-SSE versions of vector instructions would be truly obsolete, like MMX is now. AVX/AVX2 is at least slightly better in every way, if you count the VEX-encoded 128b version an insn as AVX, not SSE. Sometimes you'd still use 128b registers because your data only comes in chunks that big, but more often working with 256b registers to do the same op on twice as much data at once.

SSE/AVX/x87-FP/integer instructions all use the same execution ports. You can't get more done in parallel by mixing them. (except on Haswell, where one of the 4 ALU ports can only handle non-vector insns, like GP reg ops and branches).

Hmm, knm241 & Peter Cordes, both of you have very good answers which complement each other. Care to merge the information? Thank you both! — snoukkis, Jul 18 '15 at 23:29
I'll mark this one correct because it's shorter and so the other answer will still be visible after this one. — snoukkis, Jul 18 '15 at 23:35
I'm just going to reference knm's answer in my 2nd paragraph. I want to keep my answer as short as possible, so it's on target. A list of what was in each SSE version is available elsewhere. Maybe not summarized as nicely as knm241's answer. I did add a paragraph about how if CPU compat was a non issue, we really would never use the non-VEX encoding of vector instructions in new code. I think the VEX encoding is sometimes 1 byte longer, but usually not. The only reason not to would be to avoid `vzeroupper` when calling SSE code you can't recompile/reassemble. — Peter Cordes, Jul 19 '15 at 00:48
thanks, that paragraph made the question of deprecation more clear — snoukkis, Jul 19 '15 at 09:28

Are different mmx, sse and avx versions complementary or supersets of each other?

2 Answers2

Linked

Related