1

I found the following solution for _m128i

int horizontal_max_Vec4i(__m128i x) {
    __m128i max1 = _mm_shuffle_epi32(x, _MM_SHUFFLE(0,0,3,2));
    __m128i max2 = _mm_max_epi32(x,max1);
    __m128i max3 = _mm_shuffle_epi32(max2, _MM_SHUFFLE(0,0,0,1));
    __m128i max4 = _mm_max_epi32(max2,max3);
    return _mm_cvtsi128_si32(max4);
}

What would be the equivalent function that returns the maximum float of an m128 ?

(I can use any version of SSE and AVX)

Would appreciate any help

CheckersGuy
  • 117
  • 10

2 Answers2

5

Using your algorithm, you could just translate it into the single sized floating point versions of the intrinsics. Not saying it is the most optimal solution, but something like this:

float horizontal_max_Vec4(__m128 x) {
    __m128 max1 = _mm_shuffle_ps(x, x, _MM_SHUFFLE(0,0,3,2));
    __m128 max2 = _mm_max_ps(x, max1);
    __m128 max3 = _mm_shuffle_ps(max2, max2, _MM_SHUFFLE(0,0,0,1));
    __m128 max4 = _mm_max_ps(max2, max3);
    float result = _mm_cvtss_f32(max4);
    return result;
}
  • This makes me look stupid xD Just got started with those intrinsic and completly missed that there is the _mm_store1_ps . Thanks :P – CheckersGuy Sep 09 '17 at 00:32
  • 3
    Larsson and @CheckersGuy: you don't want to store to memory. Use `return _mm_cvtss_f32(max4)`. (Tip for searching [Intel's intrinsics guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=float%20_mm): `float _mm` will find `_mm*` intrinsics that return `float`). If you're unlucky, your compiler will compile that to an *actual* store/reload. (And you don't need `result` to be aligned if you're only using `movss` (store1), not a 16-byte store.) – Peter Cordes Sep 09 '17 at 01:59
  • 1
    Also, if you don't have AVX, you can save some MOVAPS instructions with careful choice of shuffles if compilers don't optimize this for you. Especially if you have SSE3. Horizontal MAX needs the same shuffles as horizontal ADD, so see https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86. – Peter Cordes Sep 09 '17 at 02:00
  • Agreed, `_mm_cvtss_f32` is probably less likely to confuse a compiler. –  Sep 09 '17 at 06:47
1

You can use DirectXMath,MS has done every thing for you on _m128.

F.Eazism
  • 43
  • 10