16

I have just started using SSE and I am confused how to get the maximum integer value (max) of a __m128i. For instance:

__m128i t = _mm_setr_ps(0,1,2,3);
// max(t) = 3;

Searching around led me to MAXPS instruction but I can't seem to find how to use that with "xmmintrin.h".

Also, is there any documentation for "xmmintrin.h" that you would recommend, rather than looking into the header file itself?

romeric
  • 2,325
  • 3
  • 19
  • 35
Shane
  • 2,315
  • 3
  • 21
  • 33
  • The shuffles you need are the same as for a horizontal sum, or pretty much any other horizontal reduction. See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86 for some optimized versions for float, integer, and double, with SSE2, SSE3, and AVX. Also discussion of what shuffles are optimal on which CPUs. – Peter Cordes Sep 09 '17 at 02:36
  • 1
    This question seems to be confused about float vs. integer. `__m128i` is an integer vector. `*_ps` and `MAXPS` are packed-single float. For documentation, see [the SSE tag wiki](https://stackoverflow.com/tags/sse/info) for links, and many more links at https://stackoverflow.com/tags/x86/info. One very good resource is [**Intel's intrinsics search/finder**](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) which has details on what each one does, but not as much detail as in the asm instruction reference manual. – Peter Cordes Sep 09 '17 at 02:39

4 Answers4

19

In case anyone cares and since intrinsics seem to be the way to go these days here is a solution in terms of intrinsics.

int horizontal_max_Vec4i(__m128i x) {
    __m128i max1 = _mm_shuffle_epi32(x, _MM_SHUFFLE(0,0,3,2));
    __m128i max2 = _mm_max_epi32(x,max1);
    __m128i max3 = _mm_shuffle_epi32(max2, _MM_SHUFFLE(0,0,0,1));
    __m128i max4 = _mm_max_epi32(max2,max3);
    return _mm_cvtsi128_si32(max4);
}

I don't know if that's any better than this:

int horizontal_max_Vec4i(__m128i x) {
    int result[4] __attribute__((aligned(16))) = {0};
    _mm_store_si128((__m128i *) result, x);
    return max(max(max(result[0], result[1]), result[2]), result[3]); 
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
11

If you find yourself needing to do horizontal operations on vectors, especially if it's inside an inner loop, then it's usually a sign that you are approaching your SIMD implementation in the wrong way. SIMD likes to operate element-wise on vectors - "vertically" if you like, not horizontally.

As for documentation, there is a very useful reference on intel.com which contains all the opcodes and intrinsics for everything from MMX through the various flavours of SSE all the way up to AVX and AVX-512.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Thank you for the link. The horizontal part is for a loop condition only but I will revise my approach – Shane Mar 26 '12 at 20:23
  • The link is currently: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ – Mark Lakata Dec 05 '14 at 00:09
  • @MarkLakata: thanks - answer updated - I miss the old off-line guide - as well as working without an internet connection it was also useful in that you could scrape the data for other uses. Never mind though - the new online version is still good. – Paul R Dec 05 '14 at 07:28
10

According to this page, there is no horizontal max, and you need to test the elements vertically:

movhlps xmm1,xmm0         ; Move top two floats to lower part of xmm1
maxps   xmm0,xmm1         ; Get the maximum of the two sets of floats
pshufd  xmm1,xmm0,$55     ; Move second float to lower part of xmm1
maxps   xmm0,xmm1         ; Get the maximum of the two remaining floats

Conversely, getting the minimum:

movhlps xmm1,xmm0
minps   xmm0,xmm1
pshufd  xmm1,xmm0,$55
minps   xmm0,xmm1
Iyad Ahmed
  • 80
  • 2
  • 12
Jens Björnhager
  • 5,632
  • 3
  • 27
  • 47
  • 2
    `pshufd` between `maxps` instructions has extra latency on many CPUs (including Intel). SSE3 `movshdup` will duplicate the upper float in each half of the register, so you can use it to avoid a movaps copy. – Peter Cordes Sep 09 '17 at 01:54
  • @PeterCordes, Could you write your own optimized solution? Would it be different if it was a vector of float? Thank You. – Royi Oct 10 '17 at 23:05
  • @Royi: this answer *is* for a vector of `float` (because the question is mis-titled or mixed up about float vs. integer, see my comments on the question). Optimized for which microarchitecture(s), and with which level of SSE? SSE3? Or limited to SSE2? Or AVX2? See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86 (but replace `add` with `max`) for various optimized float and integer shuffles. – Peter Cordes Oct 10 '17 at 23:18
  • Let's say SSE4. Optimized for Haswell and above. Thank You. P. S. I meant using SSE Intrinsics, Isn't the above Assembly? – Royi Oct 10 '17 at 23:48
  • @PeterCordes, I used it as can be shown here - https://codereview.stackexchange.com/questions/177658. Is that what you meant? Any idea why is it still so slow? – Royi Oct 11 '17 at 00:58
  • @Royi: Yes, the above is assembly. Writing the same thing with intrinsics just requires some `_mm_cast` intrinsics, except that starting with `movhlps` into a different register to save a `movaps` usually requires that you have a left-over C variable, because `_mm_undefined_ps()` sometimes gets you an xor-zeroed register in some compilers, which defeats the purpose of trying to save instructions. – Peter Cordes Oct 11 '17 at 01:11
5

There is no Horizontal Maximum opcode in SSE (at least up until the point where I stopped keep track of new SSE instructions).

So you are stuck doing some shuffling. What you end up with is...

movhlps %xmm0, %xmm1            # Move top two floats to lower part of %xmm1
maxps   %xmm1, %xmm0            # Get minimum of sets of two floats
pshufd  $0x55, %xmm0, %xmm1     # Move second float to lower part of %xmm1
maxps   %xmm1, %xmm0            # Get minimum of all four floats originally in %xmm0

http://locklessinc.com/articles/instruction_wishlist/

MSDN has the intrinsic and macro function mappings documented

http://msdn.microsoft.com/en-us/library/t467de55.aspx

Louis Ricci
  • 20,804
  • 5
  • 48
  • 62