Is there a faster way on AVX to find a horizontal minimum or maximum from a vector of 32-bit floats? Currently I have code which is a modification of this answer that worked with double-precision:
static inline float fast_hMax_ps(__m256 a){
const __m256 permHalves = _mm256_permute2f128_ps(a, a, 1); // permute 128-bit values to compare floats from different halves.
const __m256 m0 = _mm256_max_ps(permHalves, a);//compares 4 values with 4 other values ("old half against the new half")
//now we need to find the largest of 4 values in the half:
const __m256 perm0 = _mm256_permute_ps(m0, 0b01001110);
const __m256 m1 = _mm256_max_ps(m0, perm0);
const __m256 perm1 = _mm256_permute_ps(m1, 0b10110001);
const __m256 m2 = _mm256_max_ps(perm1, m1);
return ((float*)&m2)[0];//largest float32 from the entire vector. All entries are the same, so just grab [0]
}