0

On the Intel Intrinsics Guide there are some functions that are not mapped to assembly instructions and so are marked with ... on the right and have no Performance (latency and throughput) information. These are marked as Sequence instructions so my guess is that these are actual functions provided by immintrin.h that can be used instead of manual implementations for those tasks.
But what I don't understand is, are these just useful functions that Intel couldn't/wouldn't implement as real instructions or are they wrappers to possible future versions of the avx/avx2/avx512 extensions ?
I couldn't find any official information nor other sources on this topic (my main sources are the Intel Intrinsics Guide as well as the Intel Compiler User Guide (although I work with gcc))

Nonoreve
  • 115
  • 1
  • 11

1 Answers1

2

Things like _mm_set_epi32(int, int, int, int) would make no sense as a single machine instruction. It would need four r/m32 or register-only source operands (and an XMM destination), but x86 machine-code only ever has at most 3 operands including the destination. (Although for FMA all 3 are inputs). The only exception is vblendvps/pd and vpblendvb where an immediate byte encodes a 4th operand, but that's still only 4 total, and not 4 reg/mem and a separate destination.

See also Why isn't movl from memory to memory allowed? and What kind of address instruction does the x86 cpu have?


And often you _mm_set with constants, and want the compiler to do constant-propagation to make a single vector constant. If you wanted a gather-load of pointers, you'd use _mm_i32gather_epi32 with a vector of indices.

So generally no, they're not placeholders for planned future instructions, they're basically just convenience functions whose implementation can vary a lot depending on whether the input operands are in memory or registers. (e.g. vector shuffles). And depending on what feature-level is available, e.g. SSE4.1 pinsrd can be useful as part of _mm_set_epi32(0,0,b,a).

Or prototypes for SVML math functions like _mm_sin_ps, not really intrinsics at all. Intel uses the same _mm naming scheme and includes those in a section of the intrinsics guide partly as a convenience for people using Intel's own compiler (which comes with SVML), and perhaps partly to trick / trap people into depending on Intel APIs that make it harder to port their code to other compilers that have intrinsics but not SVML.

Or they're casts like _mm256_castsi256_si128 which is free in asm, just use the XMM low half of the register.

The C intrinsics API doesn't even have a way to ask for a __m128 where the low element is a scalar float, and the upper elements are don't-care, you only have _mm_set1_ps broadcast or _mm_set_ss zero-extend, and not all compilers can optimize that away if you only use the __m128 with things that don't care about the upper elements. (Clang's shuffle optimizer can see what's going on.) That's annoying because a scalar float in a register is just the low element of an XMM, but there's no equivalent to _mm256_castps128_ps256 (which gives you a vector with a don't-care upper half).


It would be plausible for some future CPU to introduce an instruction like vsinps to do in hardware what the SVML library function does, but unlikely. sin is too much work for one pipelined execution unit of reasonable length. x87 fsin is microcoded as 53-105 uops on Skylake for example (https://agner.org/optimize/ / https://uops.info/), and it's not generally faster than a well-optimized software implementation. Full hardware/microcode fsin is convenient for toy programs written in asm, but it's not a big deal either way for real-world code that leaves that up to a compiler / math library.

Also, doing sin in hardware / microcode nails down speed vs. precision tradeoffs in ways that might not be what people want. Also related: Intel Underestimates Error Bounds by 1.3 quintillion - fsin is quite inaccurate for inputs very near Pi, and Intel only recently fixed their documentation. (It's not like software has an easier time, although you can use extended-precision to gain accuracy.) Presumably, a SIMD float version of the same instruction would have the same accuracy problems.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847