On the Intel Intrinsics Guide there are some functions that are not mapped to assembly instructions and so are marked with ...
on the right and have no Performance (latency and throughput) information. These are marked as Sequence
instructions so my guess is that these are actual functions provided by immintrin.h
that can be used instead of manual implementations for those tasks.
But what I don't understand is, are these just useful functions that Intel couldn't/wouldn't implement as real instructions or are they wrappers to possible future versions of the avx/avx2/avx512 extensions ?
I couldn't find any official information nor other sources on this topic (my main sources are the Intel Intrinsics Guide as well as the Intel Compiler User Guide (although I work with gcc))

- 115
- 1
- 11
1 Answers
Things like _mm_set_epi32(int, int, int, int)
would make no sense as a single machine instruction. It would need four r/m32
or register-only source operands (and an XMM destination), but x86 machine-code only ever has at most 3 operands including the destination. (Although for FMA all 3 are inputs). The only exception is vblendvps/pd
and vpblendvb
where an immediate byte encodes a 4th operand, but that's still only 4 total, and not 4 reg/mem and a separate destination.
See also Why isn't movl from memory to memory allowed? and What kind of address instruction does the x86 cpu have?
And often you _mm_set
with constants, and want the compiler to do constant-propagation to make a single vector constant. If you wanted a gather-load of pointers, you'd use _mm_i32gather_epi32
with a vector of indices.
So generally no, they're not placeholders for planned future instructions, they're basically just convenience functions whose implementation can vary a lot depending on whether the input operands are in memory or registers. (e.g. vector shuffles). And depending on what feature-level is available, e.g. SSE4.1 pinsrd
can be useful as part of _mm_set_epi32(0,0,b,a)
.
Or prototypes for SVML math functions like _mm_sin_ps
, not really intrinsics at all. Intel uses the same _mm
naming scheme and includes those in a section of the intrinsics guide partly as a convenience for people using Intel's own compiler (which comes with SVML), and perhaps partly to trick / trap people into depending on Intel APIs that make it harder to port their code to other compilers that have intrinsics but not SVML.
Or they're casts like _mm256_castsi256_si128
which is free in asm, just use the XMM low half of the register.
The C intrinsics API doesn't even have a way to ask for a __m128
where the low element is a scalar float
, and the upper elements are don't-care, you only have _mm_set1_ps
broadcast or _mm_set_ss
zero-extend, and not all compilers can optimize that away if you only use the __m128
with things that don't care about the upper elements. (Clang's shuffle optimizer can see what's going on.) That's annoying because a scalar float in a register is just the low element of an XMM, but there's no equivalent to _mm256_castps128_ps256
(which gives you a vector with a don't-care upper half).
It would be plausible for some future CPU to introduce an instruction like vsinps
to do in hardware what the SVML library function does, but unlikely. sin
is too much work for one pipelined execution unit of reasonable length. x87 fsin
is microcoded as 53-105 uops on Skylake for example (https://agner.org/optimize/ / https://uops.info/), and it's not generally faster than a well-optimized software implementation. Full hardware/microcode fsin
is convenient for toy programs written in asm, but it's not a big deal either way for real-world code that leaves that up to a compiler / math library.
Also, doing sin
in hardware / microcode nails down speed vs. precision tradeoffs in ways that might not be what people want. Also related: Intel Underestimates Error Bounds by 1.3 quintillion - fsin is quite inaccurate for inputs very near Pi, and Intel only recently fixed their documentation. (It's not like software has an easier time, although you can use extended-precision to gain accuracy.) Presumably, a SIMD float version of the same instruction would have the same accuracy problems.

- 328,167
- 45
- 605
- 847