5

Is there an instruction or efficient branchless sequence of instructions to figure out the INDEX of (not the value of) the largest (or smallest) element of an unordered (unsorted) ZMM?

Data type doesn't matter- i'm more interested to know if there's a usage pattern for this established.


A related problem with a known solutions is, with a strictly ordered ZMM, one may use CMPPS, MOVMSKPS, and TZCNT to get the index of where an outside element WOULD fit into this list (i.e. BSEARCH)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Veldaeven
  • 126
  • 5
  • 1
    Other than https://www.felixcloutier.com/x86/phminposuw there are no horizontal min/max. phminposuw with some work to transform the input can give you the position of the max, or signed min or max 16-bit element, but only ever 16-bit elements. And only in the low 128-bit lane; there is no AVX2 or AVX-512 YMM / ZMM version. – Peter Cordes Mar 13 '21 at 12:55
  • 4
    IIRC, AArch64 has some nice horizontal min/max stuff for various element sizes, but x86 doesn't. AFAIK the best way is to shuffle / vertical max log(n) times, same reduction pattern as horizontal sum: [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026). Then compare for equal / movmsk / bit-scan for the position. (For byte elements you could widen to words for `phminposuw`.) – Peter Cordes Mar 13 '21 at 12:57
  • Thank you again, Peter. I am doing the same horizontal shuffling for the max, but i don't care about the index of the max. the min- i need both. i've decided to do the check in a "looser" (opposite of "tighter") loop where the branch will be the least of my worries (crossing interop bounds) – Veldaeven Mar 13 '21 at 13:13

1 Answers1

1

Broadcast the minimum (or maximum) element over the complete vector, compare vectors for equality, use movemask instruction to convert to bitmap, then count trailing zeroes in the bitmap.

Example for FP32 lanes in SSE vector:

uint32_t minpos_ps( __m128 vec )
{
    // Broadcast minimum value in the vector with a few shuffles
    __m128 i = _mm_min_ps( vec, _mm_permute_ps( vec, _MM_SHUFFLE( 1, 0, 3, 2 ) ) );
    i = _mm_min_ps( i, _mm_permute_ps( i, _MM_SHUFFLE( 2, 3, 0, 1 ) ) );

    // Compare lanes for equality with the minimum
    uint32_t mask = (uint32_t)_mm_movemask_ps( _mm_cmpeq_ps( vec, i ) );

    // Return index of the smallest set bit in the mask
    return std::countr_zero( mask );
}

More complicated example, for unsigned bytes in 32-byte AVX vector:

uint32_t minpos_epu8( __m256i vec )
{
    __m256i i = _mm256_min_epu8( vec, _mm256_permute2x128_si256( vec, vec, 1 ) );
    i = _mm256_min_epu8( i, _mm256_shuffle_epi32( i, _MM_SHUFFLE( 1, 0, 3, 2 ) ) );
    i = _mm256_min_epu8( i, _mm256_shuffle_epi32( i, _MM_SHUFFLE( 2, 3, 0, 1 ) ) );
    // If you calling this in a loop where compiler can preload constant vectors,
    // replace shuffles and shifts below with _mm256_shuffle_epi8
    __m256i tmp = _mm256_shufflehi_epi16( i, _MM_SHUFFLE( 2, 3, 0, 1 ) );
    tmp = _mm256_shufflelo_epi16( tmp, _MM_SHUFFLE( 2, 3, 0, 1 ) );
    i = _mm256_min_epu8( i, tmp );
    tmp = _mm256_or_si256( _mm256_slli_epi16( i, 8 ), _mm256_srli_epi16( i, 8 ) );
    i = _mm256_min_epu8( i, tmp );

    uint32_t mask = (uint32_t)_mm256_movemask_epi8( _mm256_cmpeq_epi8( vec, i ) );

    return std::countr_zero( mask );
}

The std::countr_zero standard library function requires C++/20.

If you don't yet have that version, replace with _tzcnt_u32, _BitScanForward, or __builtin_ctz intrinsics depending on compiler and target platform.

Soonts
  • 20,079
  • 9
  • 57
  • 130