6

I want to find the minimum/maximum value into an array of byte using SIMD operations. So far I was able to go through the array and store the minimum/maximum value into a __m128i variable, but it means that the value I am looking for is mixed among others (15 others to be exact).

I've found these discussions here and here for integer, and this page for float, but I don't understand how works _mm_shuffle*. So my questions are:

  1. What SIMD operations do I have to perform in order to extract the minimum / maximum byte (or unsigned byte) value from the __m128i variable?
  2. How does _mm_shuffle* work? I don't get it when I look to the "minimal" documentation online. I know it is related to the _MM_SHUFFLE macro, but I don't get the example.
Paul R
  • 208,748
  • 37
  • 389
  • 560
FiReTiTi
  • 5,597
  • 12
  • 30
  • 58
  • If it helps, try https://software.intel.com/sites/landingpage/IntrinsicsGuide/ For most intrinsics documented there, there's a detailed pseudo-code representation of what it does exactly – Yan Zhou Dec 06 '16 at 00:39

2 Answers2

8

Here is an example for horizontal max for uint8_t:

#include "tmmintrin.h" // requires SSSE3

__m128i _mm_hmax_epu8(const __m128i v)
{
    __m128i vmax = v;

    vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 1));
    vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 2));
    vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 4));
    vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 8));

    return vmax;
}

The max value will be returned in all elements. If you need the value as a scalar then use _mm_extract_epi8.

It should be fairly obvious how to adapt this for min, and for signed min/max.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Thanks a lot, I will test it asap! – FiReTiTi Dec 06 '16 at 07:32
  • 2
    You can save some MOVDQA instructions by using PSHUFD (`_mm_shuffle_epi32`) for the last two shuffles, since they have granularity >= 4. If you don't need the result broadcast, then you could go in the reverse order, bringing high halves down to line up with low halves. That would allow using PSHUFLW for word shuffle, again taking advantage of the fact that it's a move+shuffle. (PALIGNR updates its destination in-place, so without AVX the compiler has to copy `vmax` so it still has the original as an input for PMAXUB). – Peter Cordes Dec 06 '16 at 14:35
  • @PeterCordes: yes, good point on the shuffles - I was assuming that this wasn't going to be performance-critical, e.g. needed only for the final step of a min/max reduction, but if it is then it might be worth refining it as you suggest. – Paul R Dec 06 '16 at 14:49
  • 1
    @PeterCordes: interestingly clang does just as you suggest with `PSHUFD`s for the last two shuffles: https://godbolt.org/g/Q7e4U4 – Paul R Dec 06 '16 at 14:53
  • 1
    Nice! It's often a good idea to look at clang's asm output, and use that to refine the C implementation to get more efficient code from other compilers. I've definitely used tricks that I hadn't thought of but which clang spotted. Anyway, even if not perf-critical, saving code-size is always potentially a win for uop-cache / I-cache reasons. Using the same instruction every time is kinda nice, but horizontal-whatever is a common enough pattern that a sequence of different integer shuffles shouldn't be too confusing for future readers. – Peter Cordes Dec 06 '16 at 15:11
2

Alternatively, convert to words and use phminposuw (not tested)

int hminu8(__m128i x)
{
  __m128i l = _mm_unpacklo_epi8(x, _mm_setzero_si128());
  __m128i h = _mm_unpackhi_epi8(x, _mm_setzero_si128());
  l = _mm_minpos_epu16(l);
  h = _mm_minpos_epu16(h);
  return _mm_extract_epi16(_mm_min_epu16(l, h), 0);
}

By my quick count, the latency is a bit worse than a min/shuffle cascade, but the throughput a bit better. The linked answer with phminposuw is probably better though. Adapted for unsigned bytes (but not tested)

uint8_t hminu8(__m128i x)
{
  x = _mm_min_epu8(x, _mm_srli_epi16(x, 8));
  x = _mm_minpos_epu16(x);
  return _mm_cvtsi128_si32(x);
}

You could use it for max too, but with a bit of overhead: complement the input and result.

harold
  • 61,398
  • 6
  • 86
  • 164