Is there a register than contains the number of elements in a vector/array that I must load prior to using an operation like mulss
or addss
, or perhaps do I have to push that number on the stack? How do SSE instructions know the length of the vector without running past the end?

- 328,167
- 45
- 605
- 847

- 513
- 5
- 10
-
the SSE registers are of fixed bit length. Instructions like `mulss` and similar use all the bits of the register. Whether some bits contain meaningful value or not, and whether following code will use such result is of no concern to the `mulss` itself, it will simply multiply all of them. I.e. the simple answer is "no", the number of elements is specified by the instruction used, it's fixed per particular opcode. Running past the end is sometimes safe (when the code and memory usage is designed like that), so in some cases the code will intentionally run past and calculate junk too. – Ped7g Jul 26 '18 at 13:49
-
Prior to AVX512, each vector has the same length (16 bytes with SSE, 32 bytes with AVX). You can do masked loads/stores to read/write less elements, but typically the code is written to simply process the tail of the array sequentially. This is one of the major annoyances when writing vectorised code. – fuz Jul 26 '18 at 13:49
-
1MULSS and ADDSS instructions are scalar operations and only multiply and add respectively the lowest single-precision value of their vector operands. The other elements of the destination operand are left unchanged. (The VMULSS and VADDSS instructions work slightly differently, they clear all the bits numbered 128 and higher in the destination, bits 32 to 127 inclusive are left unchanged.) – Ross Ridge Jul 26 '18 at 14:11
1 Answers
x86 SIMD instructions operate on vector registers, not memory directly. (Or with a memory source operand of fixed width, e.g. addss xmm, xmm/m32
, meaning the source operand is either another XMM register, or a 32-bit memory operand. vs. addps xmm, xmm/m128
(packed single instead of scalar single) which takes a 128-bit source operand which can be a register or memory)
Some historical vector machines (like Cray) had vector instructions that were more like x86's rep movsd
where you did supply pointers + a length and let the hardware sort it out1.
But modern short-vector SIMD instruction sets are not like that at all. The code has to be compiled (or hand-written) for a specific vector length.
You have to write loops that avoid going past the end of your arrays. See How to avoid the error of AVX2 when the matrix dimension isn't multiples of 4? for an example of handling inputs that aren't a multiple of the vector width.
Also note that mulss
is a scalar instruction that uses the low element of an XMM register. The "operation" section of Intel's insn-set ref manual entry for it describes exactly what it does:
DEST[31:0]←DEST[31:0] * SRC[31:0]
DEST[MAXVL-1:32] (Unmodified)
The Where MAXVL is the maximum vector width supported by the hardware. (legacy SSE instruction leave the upper lanes unmodified for annoying backwards-compat reasons, unlike the AVX encoding (vmulss
) which zeros upper lanes to avoid false dependencies on the old value.)
Footnote 1: Agner Fog proposed an ISA with variable-length vector registers that would maybe be something like what you imagine. A design goal is to let existing binaries take advantage of future hardware with wider vectors without needing to be recompiled / rewritten for wider vector lengths, for simple "vertical" problems like vector dot product or c[i] += a[i] * b[i]
See discussion about it on his blog including some comparison of it vs. x86 SSE / ARM NEON / PowerPC Altivec other modern short-vector SIMD ISAs.

- 328,167
- 45
- 605
- 847
-
1Note that these variable-length vector registers do exist in silicone already. They are part of SVE, an ARM extension. – fuz Jul 26 '18 at 16:44