Why does SSE/AVX lack loading an immediate value?

Question

As far as I know, there is no instruction in SSE/AVX for loading an immediate. One workaround is loading a value to a normal register and movd, but compilers seem to think this is more costly than loading from memory even for a single scalar value.

This makes memory access necessary every time doing an operation with common constants such as 1, 0x80000000, 0x7fffffff, 0x3f800000, 0x3f000000, etc. Well, having these values encoded in the machine code will occupy 4 bytes each, but so does a 32-bit absolute or rip-relative address, and I believe an immediate load is cheaper than any sort of memory load.

I always thought something like movss xmm, imm32 or broadcastss xmm, imm32 would be nice to have, but there must be a reason for not making such instructions. Why was it designed this way?

By contrast, ARM NEON does have instructions that broadcast an immediate value into a vector. Reasons that are posted as answer won't be convincing if they would apply equally as much to NEON. — harold, May 06 '22 at 17:18
This is likely to be unanswerable unless somebody from the SSE/AVX design team sees the question and is willing to discuss what they were thinking. — Raymond Chen, May 06 '22 at 17:25
The standard solution for this is to load a constant from memory. This is how the instruction set was designed and it's the same on MMX and the x87 floating point unit. — fuz, May 06 '22 at 17:30
Several of those constants (where all the set bits are contiguous at one end of the register) can be generated in 2 instructions, starting with `pcmped xmm0,xmm0` (all-ones). See [What are the best instruction sequences to generate vector constants on the fly?](https://stackoverflow.com/q/35085059) and Agner Fog's guide. But 2 instructions is still worse than 1, or a memory source operand, so compilers generally don't do that. — Peter Cordes, May 06 '22 at 17:51
@fuz I don't know much about the history of x86, but I think x87 was designed to load a constant from memory because it was originally a stack-machine-like coprocessor, and MMX was built on top of x87. SSE was a totally new design, so it doesn't have to follow x87 and MMX. — xiver77, May 06 '22 at 17:51
AVX-512 has `vpbroadcastd z/y/zmm, eax`, so you can construct any set1_epi32() constant with a mov-immediate + that. (Strangely compilers do sometimes use that, but not pcmpeqd / psrld). — Peter Cordes, May 06 '22 at 17:56
I've wondered if lack of mov-immediate to vector reg was a matter of never decoding more than a 1-byte immediate for vector instructions. Or some other quirk of convenience / inconvenience for existing Intel microarchitectures. Intel has definitely gimped their ISA for short-term convenience in the past, like SSE1 with `cvtsi2ss` and `sqrtss` merging into an XMM (false dep) instead of zero extending, because P3 handles 128-bit vectors as two 64-bit halves, so zero-extending would take 2 uops to write a full reg. GCC spends extra dep-breaking `pxor` instructions to work around it. — Peter Cordes, May 06 '22 at 17:57
@PeterCordes: But even a one-byte immediate could have been very useful. The NEON move-immediate only includes an 8-bit immediate (with a few different options for how to decode it), and that probably covers 95% of use cases. — Nate Eldredge, May 06 '22 at 18:12
@NateEldredge: Right, yes, 1 byte might actually be a better design choice than 32-bit. (Although ARM already has complex decoding for immediates for Thumb mode, while x86 at most does sign-extension. Except for bitfields for control operands for stuff like `roundps`, but that's kinda different.) Still, not a great argument. Possibly something microarchitectural about getting immediates used as values, as opposed to shuffle or other control operands for SIMD instructions. — Peter Cordes, May 06 '22 at 18:23

Why does SSE/AVX lack loading an immediate value?

0 Answers0