SIMD and difference between packed and scalar double precision

Question

I am reading Intel's intrinsics guide while implementing SIMD support. I have a few confusions and my questions are as below.

__m128 _mm_cmpeq_ps (__m128 a, __m128 b) documentation says it is used to compare packed single precision floating points. What does "packed" mean? Do I need to pack my float values somehow before I can use them?
For double precision there are intrinsics like _mm_cmpeq_sd which means compare the "lower" double precision floating point elements. What does lower and upper double precision elemtns mean? Can I use them to compare a vector of C++ double type elements or not? Or do I need to process them in some way before I compare them?

score 38 · Accepted Answer · edited Jul 13 '22 at 11:45

In SSE, the 128 bits registers can be represented as 4 elements of 32 bits or 2 elements of 64 bits.

SSE defines two types of operations; scalar and packed. Scalar operation only operates on the least-significant data element (bit 0~31 or 0~63), and packed operation computes all elements in parallel.

_mm_cmpeq_sd is designed to work with double-precision (64-bit) floating-point elements and would only compare the least-significant data element (first 64 bits) of the two operands (scalar).

_mm_cmpeq_pd is designed to work with double-precision (64-bit) floating-point elements as well but would compare each two groups of 64 bits in parallel (packed).

_mm_cmpeq_ss is designed to work with single-precision (32-bit) floating-point elements and would only compare the least-significant data element (first 32 bits) of the two operands (scalar).

_mm_cmpeq_ps is designed to work with single-precision (32-bit) floating-point elements and would compare each group of 32 bits in parallel (packed).

If you're using 32 bits float, you could pack the float in quadruplet to make use of the 128 bits space. That way, _mm_cmpeq_ps would be able to make 4 comparison in parallel.

If you're using 64 bits double, you could pack the double in pair to make use of the 128 bits space. That way, _mm_cmpeq_pd would be able to make 2 comparison in parallel.

If you want to make only one comparison at a time, you can use _mm_cmpeq_sd to compare two 64 bits double or _mm_cmpeq_ss to compare two 32 bits float.

Note that _mm_cmpeq_sd and _mm_cmpeq_pd are SSE2 while _mm_cmpeq_ssand _mm_cmpeq_ps are SSE.

This answer is essentially OK except for "`_mm_cmpeq_sd` would only compare the least-significant data element (first 32 bits)". `_mm_cmpeq_sd ` is designed to work on `double`s (hence the letter `d` in the command name), so the correction is needed: "`_mm_cmpeq_sd` would only compare the least-significant data element (first 64 bits)". Similar misunderstanding is the next paragraph (only 2 doubles can fit into a 128 bit-long register, and the function's name should end with the letter "d"). — zkoza, Jan 12 '21 at 12:55
@zkoza yes there was a mixup between double and float operations, thanks for pointing it out. I've fixed it in the last edit and added all four scalar/packed and single/double operations to avoid any confusion. — zakinster, Jan 13 '21 at 09:37
Normally you get packed data into `__m128` and `__m128d` vectors by *loading it from contiguous memory*, you don't actually manually do any "packing". (You *can* `_mm_unpacklo_ps` to shuffle together two vectors, or 3 total shuffles to implement `_mm_set_ps(d,c,b,a)`, but it's much more efficient if your data is contiguous in the first place.) — Peter Cordes, Jul 13 '22 at 11:51

score 19 · Answer 2 · answered Apr 25 '13 at 15:26

19

In this context, "packed" means "several of the same type put into one lump" - so "packed single precision floating point" means 4 * 32 bit floating point numbers stored as a 128-bit value.

You either need to "pack" each value into the register using various PACK* instructions, or have the data already "packed" in memory, e.g. an array of (multiples of) 4 floating point values [that are suitably aligned].

Scalar means "one value" in the lower n bits of the register (e.g. a double would be the low 64 bits of a 128-bit SSE register).

answered Apr 25 '13 at 15:26

Mats Petersson

126,704
14
140
227

If you have multiple scalar floats in XMM regs to shuffle into one register, you actually want to use shuffles `unpcklps`. `pack` instructions like `packssdw` are narrowing integer operations. (So unpacking *with zero* is kind of the inverse of pack (widening integer elements), and this may be the source of this strange naming convention. Remember that Intel's integer SIMD (MMX) existed before fp `ps` SSE1 and `pd` SSE2.) – Peter Cordes Jan 13 '21 at 11:09

SIMD and difference between packed and scalar double precision

2 Answers2

Linked