14

I'm thinking of implementing 8-ary heapsort for uint32_t's. To do this I need a function that selects the index of maximum element in a 8-element vector so that I can compare it with parent element and conditionally perform swap and further siftDown steps.

(8 uint32_ts can be changed eg to 16 uint32_ts or 8 uint64_t or whatever x86 SIMD could support efficiently).

I have some ideas on how to do that but I'm looking for something faster than non-vectorized code, especially I'm looking for something that would enable me to do fast heapsort.

I have clang++ 3.3 and Core i7-4670 so probably I should be able to use even the newest x86 SIMD thingies.

(BTW: that's a part of a bigger project: https://github.com/tarsa/SortingAlgorithmsBenchmark and there's for example quaternary heapsort so after implementing SIMD heapsort I could instantly compare them)

To repeat - question is: what's the most efficient way to compute index of maximum element in x86 SIMD vector?

PS: It's not a duplicate of linked questions - notice that I'm asking for an index of a maximum element, not just the element value.

Wibowit
  • 371
  • 2
  • 11
  • 1
    possible duplicate of [Getting max value in a \_\_m128i vector with SSE?](http://stackoverflow.com/questions/9877700/getting-max-value-in-a-m128i-vector-with-sse) or [How to find the horizontal maximum in a 256-bit AVX vector](http://stackoverflow.com/questions/9795529/how-to-find-the-horizontal-maximum-in-a-256-bit-avx-vector) – hivert May 11 '14 at 08:42
  • No, it's not a duplicate. Notice that I'm looking for an index of maximum element, not just for it's value. – Wibowit May 11 '14 at 08:52
  • I guess you can just use __m128 or __m256. – Damian May 11 '14 at 09:08
  • What means "just use"? I'm looking for index of maximum element in an SIMD vector. – Wibowit May 11 '14 at 09:24
  • 1
    I don't think the fact you want the index makes much difference to the answer - SSE instructions sets are built around vertical, not horizontal computations. – Flexo May 11 '14 at 10:44
  • 2
    I've just discovered a horizontal search instruction in SSE4.1: PHMINPOSUW. Probably that could do the trick, but now the problem is eg doing 32-bit max selection using two 16-bit max selections + some masking. – Wibowit May 11 '14 at 10:55
  • Although similar to the linked question(s), I don't think this counts as a duplicate, as determining the index is a sightly different problem - voting to reopen. – Paul R May 25 '14 at 06:53

3 Answers3

13

Horizontal operations are bad news with SIMD, and particularly so with AVX, where most 256 bit instructions are actually broken into two separate 128 bit operations. Having said that, if you really must do a horizontal 32 bit max across 8 elements then I think the general approach would have to be:

  • find the max value (typically several shift/permute and max operations)
  • splat the max value across all 8 elements of a second vector (can be combined with previous operation)
  • do a compare between the original vector and the max vector (_mm256_cmpeq_epi32)
  • extract scalar mask (_mm256_movemask_epi8)
  • convert scalar mask to index

Here is a first pass at an AVX2 implementation I just put together - I tested it and benchmarked it on a 2.6 GHz Haswell and it runs at around 1.7 ns / vector (including loading the vector and storing the resulting index):

uint8_t _mm256_hmax_index(const __m256i v)
{
    __m256i vmax = v;

    vmax = _mm256_max_epu32(vmax, _mm256_alignr_epi8(vmax, vmax, 4));
    vmax = _mm256_max_epu32(vmax, _mm256_alignr_epi8(vmax, vmax, 8));
    vmax = _mm256_max_epu32(vmax, _mm256_permute2x128_si256(vmax, vmax, 0x01));

    __m256i vcmp = _mm256_cmpeq_epi32(v, vmax);

    uint32_t mask = _mm256_movemask_epi8(vcmp);

    return __builtin_ctz(mask) >> 2;
}
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • I have used the code in https://github.com/tarsa/SortingAlgorithmsBenchmark/blob/f6c8bf1fc0463b9f4c5b81b4da9d9fa4190e5801/sortalgo/sortheapsimddwordcascadingvariantb.hpp and https://github.com/tarsa/SortingAlgorithmsBenchmark/blob/f6c8bf1fc0463b9f4c5b81b4da9d9fa4190e5801/sortalgo/sortheapsimddwordvariantb.hpp and it seems to work nicely. I hope I can use your code without worrying, can I? – Wibowit May 24 '14 at 22:05
  • You mean in terms of copyright ? Sure - anything I post on StackOverflow is effectively public domain (although attribution is always nice of course). – Paul R May 25 '14 at 06:51
  • So what should I add to the source code? Link to this post? – Wibowit May 25 '14 at 16:51
  • Sure - if it's convenient - no obligation though. – Paul R May 25 '14 at 16:55
  • 2
    Link added: https://github.com/tarsa/SortingAlgorithmsBenchmark/commit/e57f98a0ba1cbaaa45a4dcf0acd73be45f6923d8 – Wibowit May 25 '14 at 17:40
  • Good solution. One advice is that change the divide operation from the last to bit right shift operation for better efficiency. – rookiepig May 31 '16 at 02:33
  • @rookiepig: thanks - most compilers are smart enough to convert the divide to a shift in this context, but it doesn't hurt to make it explicit I guess - answer updated. – Paul R May 31 '16 at 05:37
6

The most efficient way to do a horizontal operation (dot product, sum, max-index, whatever) on an n-way SIMD vector is to do n of them at once, by transposing them and using vertical ops instead. Certain SIMD architectures have better support for horizontal ops, but in general the blockwise transposed approach will be more flexible and efficient.

Sneftel
  • 40,271
  • 12
  • 71
  • 104
2

Using the Vc library I'd write:

size_t maximumIndex(Vc::uint_v vec) {
  const unsigned int max = vec.max();
  return (max == vec).firstOne();
}

With intrinsics it should be something along these lines (this is AVX without AVX2 - with AVX2 it becomes slightly easier):

size_t maximumIndex(_mm256i vec) {
  __m128i lo = _mm256_castsi256_si128(vec);
  __m128i hi = _mm256_extractf128_si256(vec, 1);
  __m128i tmp = _mm_max_epu32(lo, hi);
  tmp = _mm_max_epu32(tmp, _mm_shuffle_epi32(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
  tmp = _mm_max_epu32(tmp, _mm_shufflelo_epi16(tmp, _MM_SHUFFLE(1, 0, 3, 2))); // using lo_epi16 for speed here
  const int max = _mm_cvtsi128_si32(tmp);
  tmp = _mm_packs_epi16(_mm_packs_epi32(_mm_cmpeq_epi32(_mm_set1_epi32(max), lo),
                                        _mm_cmpeq_epi32(_mm_set1_epi32(max), hi)),
                        _mm_setzero_si128());
  return _bit_scan_forward(_mm_movemask_epi8(tmp));
}

BTW, if you want some inspiration from SIMDized merge-sort look here: http://code.compeng.uni-frankfurt.de/projects/vc/repository/revisions/master/entry/src/avx_sorthelper.cpp

Vir
  • 449
  • 5
  • 5
  • That sorthelper is only supposed to sort a single vector? It seems it is not a general sort algorithm for arbitrary sized arrays. – Wibowit May 11 '14 at 16:02
  • Also, can I steal the you've posted here? :) – Wibowit May 11 '14 at 16:39
  • With merge-sort that's just the first step. You can continue from there to implement merge-sort on larger data-sets. – Vir May 11 '14 at 19:56
  • And yes, feel free to take/use my code. Vc code, BTW, is LGPL3 (will be BSD in the future). – Vir May 11 '14 at 19:59
  • @Vir: I fixed a couple of typos in your code and tested it but it doesn't seem to give correct results - maybe my fixes are not correct - could you take look and see if I messed something up ? I'd like to benchmark this against a scalar implementation and my AVX2 implementation if I can get it working. – Paul R May 11 '14 at 22:35
  • @Vir: it's OK - I found the problem - I was testing with unsigned data (OP is working with `uint32_t`) - I changed `_mm_max_epi32` to `_mm_max_epu32` in your code and that made it work OK in my test harness. FYI it runs at around 2.2 ns / vector on my 2.6 GHz Haswell. – Paul R May 11 '14 at 22:41
  • 1
    Thanks for the fixes. I overlooked the “unsigned” detail... :-) – Vir May 12 '14 at 06:21
  • I've added a link to this post in the source code: https://github.com/tarsa/SortingAlgorithmsBenchmark/commit/51bbcd89953b57bfec24c31d80d9f9bd3655ad6d Would that be OK? – Wibowit May 25 '14 at 17:54