3

I was try trying to find the min of 8 long ints using AVX2. I am a greenie for SIMD programming and I have no idea where to start. I did not see any post/example which explains how to carry out min and max in AVX2. I know that I cannot exceed more than 4 long ints due to the 256 bit limit, but I can solve my problem using three steps . Also I cannot figure out how to load the data of an already existing normal long int array into vectors for avx2.

I know the idea behind the process, This is what I am trying to achieve

long int nums = {1 , 2, 3 , 4 , 5 , 6 , 7, 8}
a = min(1,2) ; b = min(3,4) ; c = min(5,6) ; d = min(7,8)
x = min(a,b) ; y = min(c,d)
answer  = min(x,y)

Can someone help me out as to how to get this to work. Also the last min is a single operation , is it better to do it on the CPU? Should I use something else other than AVX2? ( I am on a x86 system)

Cœur
  • 37,241
  • 25
  • 195
  • 267
g7573025
  • 33
  • 3
  • 2
    If your numbers fit unsigned 16 bits, you can use `PHMINPOSUW` instruction which stands for **P**acked **H**orizontal **MIN**imum and **POS**ition of **U**nsigned **W**ords. In the [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) I have not found a corresponding intrinsic, [here](https://msdn.microsoft.com/en-us/library/vstudio/bb514085(v=vs.100).aspx) one from Microsoft. There is a *VEX* version that clear the upper 128 bit of the destination register to avoid false dependency. Maybe someone more expert can tell you a better approach. –  Jul 25 '15 at 11:00
  • 8 integers is not convenient for AVX (too few). If you add more information about higher-level code, we could help you better =) – stgatilov Jul 25 '15 at 14:25
  • The data is exactly 8 long ints and they are always large numbers. @stgatilov Its just a simple function to find minimum which will be called constantly. I am okay to switch to other languages if AVX is not good. – g7573025 Jul 26 '15 at 00:40

1 Answers1

5

For x86 optimization and so on, see the links on https://stackoverflow.com/tags/x86/info. Esp. Intel's intrinsics guide, and Agner Fog's stuff.

If you always have exactly 8 elements (64 bytes), that simplifies things a lot. One of the major challenges when vectorizing small stuff is to not add too much startup/cleanup overhead handling the leftover elements that don't fill a whole vector.

AVX2 doesn't have min/max instructions for packed 64bit ints. Only 8, 16, and 32. That means you need to emulate it with a compare that generates a mask (all-0s for elements where the condition is false, all-1s where true, so you can AND this mask to zero out elements in other vectors.) To save on actually doing the AND/ANDN and OR operations to combine things with the mask, there are blend instructions.

AVX-512 will bring a big speedup for this operation. (support coming in (xeon-only) Skylake). It has a _mm_min_epi64. There's also a library function for this operation: __int64 _mm512_reduce_min_epi64 (__m512i a). I assume this intrinsic will emit a sequence of vpminsq instructions. Intel lists it in their intrinsic finder, but it's just an Intel library function, not a machine instruction.

Here's an AVX2 implementation that should work. I haven't tested it, but the compiled output looks like the right instruction sequence. I may have gotten a comparison reversed in there somewhere, so check it.

The principle of operation is: get the elementwise min of two 256b vectors. Split that into two 128b vectors and get the elementwise min of that. Then take that vector of two 64b values back to GP registers and do the final min. Max is done at the same time, interleaved with the min.

(Oops, you mentioned min/max in your question, but now I see you only actually just wanted min. Removing the un-needed parts is trivial, and you can change it to a return value instead of storing results through pointers/references. A scalar version might be faster; better test in the context of where your app uses this operation (not a standalone microbenchmark).)

#include <stdint.h>
#include <immintrin.h>

int64_t input[8] = { 1, 2, 3, };

#define min(a,b) \
   ({ __typeof__ (a) _a = (a); __typeof__ (b) _b = (b); \
     _a < _b ? _a : _b; })

#define max(a,b) \
   ({ __typeof__ (a) _a = (a); \
       __typeof__ (b) _b = (b); \
     _a > _b ? _a : _b; })

// put this where it can get inlined.  You don't want to actually store the results to RAM
// or have the compiler-generated VZEROUPPER at the end for every use.
void minmax64(int64_t input[8], int64_t *minret, int64_t *maxret)
{
    __m256i *in_vec = (__m256i*)input;
    __m256i v0 = in_vec[0], v1=in_vec[1];  // _mm256_loadu_si256 is optional for AVX

    __m256i gt = _mm256_cmpgt_epi64(v0, v1); // 0xff.. for elements where v0 > v1.  0 elsewhere
    __m256i minv = _mm256_blendv_epi8(v0, v1, gt);  // take bytes from v1 where gt=0xff (i.e. where v0>v1)
    __m256i maxv = _mm256_blendv_epi8(v1, v0, gt);  // input order reversed

    /* for 8, 16, or 32b:  cmp/blend isn't needed
       minv = _mm256_min_epi32(v0,v1);
       maxv = _mm256_min_epi32(v0,v1);  // one insn shorter, but much faster (esp. latency)
       And at the stage of having a 128b vectors holding the min and max candidates,
       you'd shuffle and repeat to get the low 64, and optionally again for the low 32,
       before extracting to GP regs to finish the comparisons.
     */

    __m128i min0 = _mm256_castsi256_si128(minv); // stupid gcc 4.9.2 compiles this to a vmovdqa
    __m128i min1 = _mm256_extracti128_si256(minv, 1);  // extracti128(x, 0) should optimize away to nothing.

    __m128i max0 = _mm256_castsi256_si128(maxv);
    __m128i max1 = _mm256_extracti128_si256(maxv, 1);

    __m128i gtmin = _mm_cmpgt_epi64(min0, min1);
    __m128i gtmax = _mm_cmpgt_epi64(max0, max1);
    min0 = _mm_blendv_epi8(min0, min1, gtmin);
    max0 = _mm_blendv_epi8(max1, max0, gtmax);

    int64_t tmp0 = _mm_cvtsi128_si64(min0);    // tmp0 = max0.m128i_i64[0];  // MSVC only
    int64_t tmp1 = _mm_extract_epi64(min0, 1);
    *minret = min(tmp0, tmp1);  // compiles to a quick cmp / cmovg of 64bit GP registers

    tmp0 = _mm_cvtsi128_si64(max0);
    tmp1 = _mm_extract_epi64(max0, 1);
    *maxret = min(tmp0, tmp1);
}

This may or may not be faster than doing the whole thing in GP registers, since 64bit load is one uop, cmp is one uop, and cmovcc is only 2 uops (on Intel). Haswell can issue 4 uops per cycles. Until you get to the bottom of the compare tree, there's lots of independent work to do, and even so, cmp is 1 cycle latency, and cmov is 2. If you're interleaving the work for a min and a max at the same time, there's two separate dependency chains (or trees in this case).

The vector version has much higher latency than throughput. If you need this operation on multiple independent sets of 8 values, the vector version is probably going to do well. Otherwise, the 5 cycle latency of pcmpgt*, and 2 cycle latency of blendv is going to hurt. If there is other independent work that can be happening in parallel, then that's fine.

If you had smaller integers, pmin* (signed or unsigned, 8, 16, or 32b) is 1 cycle latency, 2 per cycle throughput. For 16b unsigned elements only, there's even a horizontal min instruction that gives you the min element out of the 8 in one vector, as user-number-guy commented. This cuts out the whole split / min narrowing process that's needed after getting the min candidates down to fitting in one vector.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • There is a thing that I cannot understand. Since extracting the lower half can be done without any operation, shouldn't compilers automatically replace extract intrinsic with cast? The argument is immediate, i.e. compile-time constant. – stgatilov Jul 25 '15 at 14:59
  • Oh, it looks like gcc 4.9.2 does compile it the same as a cast. (and the arg always has to be a compile-time constant, since it has to go in an `imm8` in the instruction.) IDK if compilers have always been smart like that, or if there's a risk you could generate worse code on other (esp. older) compilers. I tested for `extracti128` (still a useless `vmovdqa` instead of referencing the old value) and `extract_epi64` (`vmovq`). – Peter Cordes Jul 25 '15 at 15:05
  • clang 3.5 does the same (optimizes `extract(..., 0)`), and doesn't have gcc's problem of emitting useless `vmovdqa %xmm4,%xmm3` for `extracti128(0)`. Even at `-O0`. It still feels weird to write the intrinsic for an instruction I don't want emitted. In asm, you can write `vpextrq $0, %xmm0, %rcx`, and it will work like `vmovq` but slower. – Peter Cordes Jul 25 '15 at 15:14
  • As for me it is weird to use entirely different functions to access 0-th and 1-th halves of the pair. I imagine it may cause issues in template code, when you do not want to write if-s to check if index is zero or not. – stgatilov Jul 25 '15 at 18:06
  • Yeah, that's a good argument. Until you brought it up, I just assumed the intrinsics mapped directly to asm instructions (other than load/store). Intel's intrinsics guide doesn't say that `extract_epi64` can sometimes be `movq`. `pextrq` is the only instruction listed. Obviously this is a good optimization that makes better code, and lets you be more consistent when writing your code. – Peter Cordes Jul 25 '15 at 18:14
  • So is it better to do something else other than AVX2 such as SSE ?? Sorry for asking too many questions and thanks for all your help . I always have 8 long ints and they are large numbers, so I cannot process them with less bits. – g7573025 Jul 26 '15 at 00:39
  • @g7573025: I'm not sure which will be more efficient, and it might depend on the surrounding code. Your obvious version (4 mins, then 2 mins, then 1) in the question should produce good code for a non-vector version. So I suggest trying my AVX2 version and your scalar version. AVX2 *is* SSE, with some extra stuff, and double the vector width. http://stackoverflow.com/tags/avx/info. So no, an SSE-only version won't be better. (It would mean an extra 2 `pcmpgt / pblendv` steps instead of a `vextracti128` – Peter Cordes Jul 26 '15 at 00:59
  • Is it possible to also get the minimum position in the array. I just noticed that I need the position of the min after the call, and this will only give the min. I believe it will be fine if we dont have the min and get the position instead as we can easily get min from that. Also will this be faster than a linear for loop? – g7573025 Jul 26 '15 at 01:17
  • It'll be faster to use scalar code to get the min position. The only thing I can think of for vectors is to get the min with vectors, and then use a vector `_mm_cmpeq_epi64` -> `movmsk` to find an element matching the min. Or `blendv` vectors of indices as well as vectors of values using the same mask from `cmpgt_epi64`. Either way is kinda silly. Since you have a small fixed-length set of elements to search, the code won't need any loops, just an optimal sequence of `cmp / cmov`, which is quite fast. – Peter Cordes Jul 26 '15 at 01:24
  • btw, I deleted my AVX-512 comment after noticing that the `reduce` to a GP reg function is *not* a single-instruction, but rather an Intel library function. AVX-512 will still bring a big speedup for min-value (not min-pos), because of adding 64bit int packed-min. – Peter Cordes Jul 26 '15 at 01:29
  • The main reason scalar is not a bad choice here is that a vector can't hold very many 64bit integers. If you were talking about 8bit, or even 32bit, then vector code would be replacing twice as many scalar ops. (And I think I remember seeing a technique for tracking indices when sorting, based on using a compare-equal between a vector `min` and one of the inputs, and using that to select indices. (http://stackoverflow.com/questions/31486942/sorting-64-bit-structs-using-avx/31487698) That could work, but again, for 64bit ints with AVX2 available, I think scalar code will probably be best. – Peter Cordes Jul 26 '15 at 01:35