0

Is there a faster way on AVX to find a horizontal minimum or maximum from a vector of 32-bit floats? Currently I have code which is a modification of this answer that worked with double-precision:

static inline float fast_hMax_ps(__m256 a){
    const __m256 permHalves = _mm256_permute2f128_ps(a, a, 1); // permute 128-bit values to compare floats from different halves.
    const __m256 m0 = _mm256_max_ps(permHalves, a);//compares 4 values with 4 other values ("old half against the new half")

    //now we need to find the largest of 4 values in the half:
    const __m256 perm0 = _mm256_permute_ps(m0, 0b01001110);
    const __m256 m1 = _mm256_max_ps(m0, perm0);

    const __m256 perm1 = _mm256_permute_ps(m1, 0b10110001);
    const __m256 m2 = _mm256_max_ps(perm1, m1);
    return ((float*)&m2)[0];//largest float32 from the entire vector. All entries are the same, so just grab [0]
}
Kari
  • 1,244
  • 1
  • 13
  • 27
  • Personally I'd prefer a more *readable* version. Nevermind the performance.. – Jesper Juhl Feb 16 '20 at 16:24
  • 1
    If you want a scalar float result, narrow to 128-bit vectors as the first step. As explained in [Fastest way to do horizontal SSE vector sum on x86](//stackoverflow.com/q/6996764), use [How to sum \_\_m256 horizontally?](//stackoverflow.com/q/13219146) with max instead of add. – Peter Cordes Feb 16 '20 at 16:28
  • Also, aliasing a `float*` onto a `__m256` object is strict-aliasing UB. `float*` isn't a "may_alias" type the way `char*` and `__m256*` are. – Peter Cordes Feb 16 '20 at 16:34

0 Answers0