7

Suppose to have a __m128 variable holding 4 SP values, and you want the minimum one, is there any intrinsic function available, or anything other than the naive linear comparison among the values?

Right know my solution is the following (suppose the input __m128 variable is x):

x = _mm_min_ps(x, (__m128)_mm_srli_si128((__m128i)x, 4));
min = _mm_min_ss(x, (__m128)_mm_srli_si128((__m128i)x, 8))[0];

Which is quite horrible but it's working (btw, is there anything like _mm_srli_si128 but for the __m128 type?)

Filippo Bistaffa
  • 551
  • 3
  • 16
  • 1
    Duplicate of [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/q/6996764), but replace `add` with `min`. – Peter Cordes Jul 14 '18 at 00:25

2 Answers2

7

There is no single instruction/intrinsic but you can do it with two shuffles and two mins:

__m128 _mm_hmin_ps(__m128 v)
{
    v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3)));
    v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 0, 3, 2)));
    return v;
}

The output vector will contain the min of all the elements in the input vector, replicated throughout the output vector.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • are you sure this is right? I tested with this input vector (from `v[0]` to `v[3]`) `0.109375 0.096875 0.093750 0.096875` and this is the output: `0.096875 0.096875 0.093750 0.093750`. What do you mean by replicated? – Filippo Bistaffa Jul 14 '13 at 13:18
  • Sorry - I messed up the permute constants - it's fixed and tested now. The output vector has all 4 elements equal to the min element of the input vector. – Paul R Jul 14 '13 at 13:49
  • If you don't need the result broadcasted, you can let the compiler save a `movaps` or two with SSE1 or SSE3 by using `_mm_movehl_ps` (`movhlps`)) and optionally SSE3 `_mm_movehdup_ps`. See [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/q/6996764). (Just replace `add` with `min`; the shuffling is the same.) – Peter Cordes Jul 14 '18 at 00:23
3

Paul R's answer is great! (@Paul R - if you read this thank you!) I just wanted to try to explain how it actually works for anyone new to SSE stuff like me. Of course I might be wrong somewhere, so any corrections are welcome!

How does _mm_shuffle_ps work?

First of all, SSE registers have indexes that go in reverse to what you might expect, like this:

[6, 9, 8, 5] // values
 3  2  1  0  // indexes

This order of indexing makes vector left-shifts move data from low to high indices, just like left-shifting the bits in an integer. The most-significant element is at the left.


_mm_shuffle_ps can mix the contents of two registers:

// __m128 a : (a3, a2, a1, a0)
// __m128 b : (b3, b2, b1, b0)
__m128 two_from_a_and_two_from_b = _mm_shuffle_ps(b, a, _MM_SHUFFLE(3, 2,   1, 0));
//                                                                  ^  ^    ^  ^ 
//                                            indexes into second operand    indexes into first operand
// two_from_a_and_two_from_b : (a3, a2, b1, b0)

Here, we only want to shuffle the values of one register, not two. We can do that by passing v as both parameters, like this (you can see this in Paul R's function):

// __m128 v : (v3, v2, v1, v0)
__m128 v_rotated_left_by_1 = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3));
// v_rotated_left_by_1 : (v2, v1, v0, v3) // i.e. move all elements left by 1 with wraparound

I'm going to wrap it in a macro for readability though:

#define mm_shuffle_one(v, pattern)  _mm_shuffle_ps(v, v, pattern)

(It can't be a function because the pattern argument to _mm_shuffle_ps must be constant at compile time.)

Here's a slightly modified version of the actual function – I added intermediate names for readability, as the compiler optimizes them out anyway:

inline __m128 _mm_hmin_ps(__m128 v){
    __m128  v_rotated_left_by_1 = mm_shuffle_one(v,  _MM_SHUFFLE(2, 1, 0, 3));
    __m128 v2 = _mm_min_ps(v,   v_rotated_left_by_1);

    __m128 v2_rotated_left_by_2 = mm_shuffle_one(v2, _MM_SHUFFLE(1, 0, 3, 2));
    __m128 v3 = _mm_min_ps(v2, v2_rotated_left_by_2);

    return v3;
}

Why are shuffling the elements the way we are? And how do we find the smallest of four elements with just two min operations?

I had some trouble following how you can min 4 floats with just two vectorized min operations, but I understood it when I manually followed which values are min'd together, step by step. (Though it's likely more fun to do it on your own than read it)

Say we've got v:

[7,6,9,5] v

First, we min the values of v and v_rotated_left_by_1:

[7,6,9,5] v
 3 2 1 0  // (just the indices of the elements)
[6,9,5,7] v_rotated_left_by_1
 2 1 0 3  // (the indexes refer to v, and we rotated it left by 1, so the indices are shifted)
--------- min
[6,6,5,5] v2
 3 2 1 0 // (explained
 2 1 0 3 //  below    )

Each column under an element of v2 tracks which indexes of v were min'd together to get that element. So, going column-wise left to right:

v2[3] == 6 == min(v[3], v[2])
v2[2] == 6 == min(v[2], v[1])
v2[1] == 5 == min(v[1], v[0])
v2[0] == 5 == min(v[0], v[3])

Now the second min:

[6,6,5,5] v2
 3 2 1 0
 2 1 0 3
[5,5,6,6] v2_rotated_left_by_2
 1 0 3 2
 0 3 2 1
--------- min
[5,5,5,5] v3
 3 2 1 0
 2 1 0 3
 1 0 3 2
 0 3 2 1

Voila! Each column under v3 contains (3,2,1,0) - each element of v3 has been mind with all the elements of v - so each element contains the minimum of the whole vector v.

After using the function, you can extract the minimum value with float _mm_cvtss_f32(__m128):

__m128 min_vector = _mm_hmin_ps(my_vector);
float minval = _mm_cvtss_f32(min_vector);

***

This is just a tangential thought, but what I found interesting is that this approach could be extended to sequences of arbitrary length, rotating the result of the previous step by 1, 2, 4, 8, ... 2**ceil(log2(len(v))) (i think) at each step. That's cool from a theoretical perspective - if you can compare two sequences element-wise simultaneously, you can find the minimum/maximum1 of a sequences in logarithmic time!

1 This extends to all horizontal folds/reductions, like sum. Same shuffles, different vertical operation.

However, AVX (256-bit vectors) makes 128-bit boundaries special, and harder to shuffle across. If you only want a scalar result, extract the high half so every step narrows the vector width in half. (Like in Fastest way to do horizontal float vector sum on x86, which has more efficient shuffles than 2x shufps for 128-bit vectors, avoiding some movaps instructions when compiling without AVX.)

But if you want the result broadcast to every element like @PaulR's answer, you'd want to do in-lane shuffles (i.e. rotate within the 4 elements in every lane), then swap halves, or rotate 128-bit lanes.

uryga
  • 402
  • 4
  • 14
  • *vectors of arbitrary length*: With AVX / AVX512, it's usually best to narrow in half down to 128 bit vectors. So extract the high half, then min high/low halves (like [Horizontal sum of 32-bit floats in 256-bit AVX vector](https://stackoverflow.com/q/23189488) but with min instead of add). Or if you really want the result broadcast at the end instead of narrowing to scalar, you still want to minimize lane-crossing shuffles. e.g. `vperm2f128` to swap high/low, then `vminps` and in-lane shuffles. – Peter Cordes Jul 14 '18 at 00:18
  • But for the 128-bit case without AVX, see [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/q/6996764) for shuffles that save the compiler a `movaps`. (e.g. use `_mm_movehl_ps` (`movhlps`)). Also, the intrinsic for getting the low element of a vector as a scalar float is `_mm_cvtss_f32`. `_mm_store_ss` will probably optimize away, but taking the address of a local is overcomplicated. – Peter Cordes Jul 14 '18 at 00:20
  • Thanks for your comment. That's not really what I was talking about, and I changed my answer a bit to reflect that, but it's still good to know! Although my laptop doesn't support anything beyond SSE, so I won't be testing that anyway... – uryga Jul 14 '18 at 00:27
  • Since you said "any corrections are welcome", I made some edits to improve / clarify your answer. Feel free to re-edit to put in your own words, or remove anything you think is too much of a tangent. – Peter Cordes Jul 14 '18 at 00:46
  • Thank you! Good to see a correction to the bit about `_mm_store_ss` - tbh it was simply the first method I found. btw, which one would you recommend if I'm writing the result to an array? – uryga Jul 14 '18 at 00:55
  • I'd probably still use `*ptr = _mm_cvtss_f32(v)`; it will still compile to a `movss [mem], xmm0`. Of course, it would probably be even better to arrange things so you calculate a whole vector of values and `_mm_storeu_ps(ptr, v)`. i.e. in a nested loop where the outer loop is over destination elements and the inner loop is a reduction, SIMDing the outer loop to do 4 reductions in parallel can be best. (Depending on how the source data is laid out, you might still need shuffles at the end, but you could use `_mm_shuffle_ps` with 2 different inputs to be doing 4 useful `min`s at each step.) – Peter Cordes Jul 14 '18 at 01:01
  • I'll try it – it just looks nicer ;) the function I'm writing is basically min+max over subranges of an array, and the resulting `min`s and `max`es are stored interleaved: `[min_range1, max_range1, min_range2, max_range2, min_range3, max_range3, ...]`. I guess I can try accumulating (min1, max1, min2, max2) in an intermediate __mm128 and then writing them out at once with `_mm_storeu_ps` instead of outputting directly to the array - sounds like it could be faster? I'll see if that improves anything. – uryga Jul 14 '18 at 01:32
  • Do you need to interleave min/max? It's normally more SIMD-friendly to have an array of `min[i]` and a separate array of `max[i]`. That way you could clamp or check 4 values in parallel for being out of bounds. But IDK how you're using the array of pairs you create. If it's only used for scalar ops, then cache locality (and only needing one base address) favours keeping pairs together. Don't go overboard with shuffles: modern Intel only has 1 shuffle per clock throughput, and 1 store per clock. So if you need to shuffle anyway, doing it in a way that saves stores is good. Otherwise meh. – Peter Cordes Jul 14 '18 at 02:47