1

I need to find the highest and lowest value in an array of floats. Optionally, I want to be able to skip members of the array and evaluate only every 2nd, 4th, 8th, etc. element:

float maxValue = 0.0;
float minValue = 0.0;

int index = 0;
int stepwith = 8;
    
for( int i = 0; i < pixelCount; i++ )
{
    float value = data[index];
        
    if( value > maxValue )
            maxValue = value;
        
    if( value < minValue )
            minValue = value;
        
    index += stepwidth;
    if( index >= dataLength )
        break;
}

That seems to be the fastest way without using other magic.

So I tried other magic, namely the vIsmax() and vIsmin() functions from vecLib (included in OSX' Accelerate framework) which apparently uses processor-level acceleration of vector operations:

int maxIndex = vIsmax( pixelCount, data );
int minIndex = vIsmin( pixelCount, data );

float maxValue = data[maxIndex];
float minValue = data[minIndex];

It is very fast but doesn't allow skipping values (the functions don't offer a 'stride' argument). This makes it actually slower than my first code because I can easily skip every 8th value.

I even found a third way with BLAS which implements a similar function:

cblas_isamax( const int __N, const float *__X, const int __incX )

with __incX = stride to skip values, but it isn't fast at all and only finds absolute maxima which doesn't work for me.

So my question: can anyone think of another way to accelerate this?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Sebastian
  • 904
  • 1
  • 8
  • 16
  • You could write SIMD code (using intrinsics) which gets the min and max in one pass. The stride would be tricky though, but if you have specific values of stride that you care about then you could specialise for those cases. – Paul R Sep 19 '16 at 14:24
  • 1
    Interesting, I didn't know I could do that... Thanks! Led me back to this question wich seems like a good starting point: http://stackoverflow.com/questions/15238978/sse3-intrinsics-how-to-find-the-maximum-of-a-large-array-of-floats – Sebastian Sep 19 '16 at 20:02

3 Answers3

1

Following the suggestion in the comment I looked into Intel intrinsics and came up with this code. Fair warning: this is my very first approach to intrinsics and is highly experimental. It works though:

#include <emmintrin.h>

void vec_minmax( float * inData, int length, float * outMin, float * outMax )
{
    // In each iteration of the loop we will gather 8 from 64
    // values (only fetching every 8th value).

    // Build an index set that points to 8 consecutive floats.
    // These indexes will later be spread up by factor 8 so they
    // point to every 8th float.
    // NOTE: these indexes are bytes, in reverse order.
    __m256i vindex = _mm256_set_epi32( 28, 24, 20, 16, 12, 8, 4, 0 );

    // Gather the first 8 floats.
    __m256 v_min = _mm256_i32gather_ps( inData, vindex, 8 );
    __m256 v_max = v_min;

    for( int i = 64; i < length; i += 64 )
    {
        // gather the next set of floats.
        __m256 v_cur = _mm256_i32gather_ps(( inData + i ), vindex, 8 );

        // Compare every member and store the results in v_min and v_max. 
        v_min = _mm256_min_ps( v_min, v_cur );
        v_max = _mm256_max_ps( v_max, v_cur );
    }

    // Store the final result in two arrays.
    float * max8;
    float * min8;

    posix_memalign( (void **)&min8, 32, ( 8 * sizeof( float )));
    posix_memalign( (void **)&max8, 32, ( 8 * sizeof( float )));

    _mm256_store_ps( min8, v_min );
    _mm256_store_ps( max8, v_max );

    // Find the min/max value in the arrays.
    * outMin = min8[0];
    * outMax = max8[0];
    for( int i = 0; i < 8; i++ )
    {
        if( min8[i] < * outMin )
            * outMin = min8[i];

        if( max8[i] > * outMax )
            * outMax = max8[i];
    }
}

So this function finds the min and max value in a set of floats, examining only every 8th value which is enough precision for my needs.

Unfortunately, it is not significantly faster than the trivial scalar approach with a simple for-loop and two if-statements (like shown above). At least not with a stride of 8.

Paul R
  • 208,748
  • 37
  • 389
  • 560
Sebastian
  • 904
  • 1
  • 8
  • 16
  • The AVX2 gather instructions are pretty inefficient (see [this question](http://stackoverflow.com/questions/24756534/in-what-situation-would-the-avx2-gather-instructions-be-faster-than-individually/24757370#24757370)). I would suggest doing a simple implementation for the non-strided variant, and then perhaps one or more special implementations for the strided variants, probably using normal loads and shuffle/permute instructions to assemble each vector. SSE would probably be easier than AVX/AVX2 for this. – Paul R Sep 20 '16 at 14:45
0

Here is an implementation for the stride = 8 case, using SSE. I've tested the code but I haven't had time to benchmark it yet. I'm not entirely confident that it will be any faster than a scalar implementation, but it's worth a try...

#include <tmmintrin.h>
#include <float.h>

void vec_minmax_stride_8(const float * inData, int length, float * outMin, float * outMax)
{
    __m128i vmax = _mm_set1_ps(-FLT_MAX);
    __m128i vmin = _mm_set1_ps(FLT_MAX);

    for (int i = 0; i < length; i += 32)
    {
        __m128i v0 = _mm_loadu_ps(&inData[i]);
        __m128i v1 = _mm_loadu_ps(&inData[i + 8]);
        __m128i v2 = _mm_loadu_ps(&inData[i + 16]);
        __m128i v3 = _mm_loadu_ps(&inData[i + 24]);

        v0 = _mm_shuffle_ps(v0, v1, _MM_SHUFFLE(0, 0, 0, 0));
        v2 = _mm_shuffle_ps(v2, v3, _MM_SHUFFLE(0, 0, 0, 0));
        v0 = _mm_shuffle_ps(v0, v2, _MM_SHUFFLE(2, 0, 2, 0));

        vmax = _mm_max_ps(vmax, v0);
        vmin = _mm_min_ps(vmin, v0);
    }

    vmax = _mm_max_ps(vmax, _mm_alignr_epi8(vmax, vmax, 4));
    vmin = _mm_min_ps(vmin, _mm_alignr_epi8(vmin, vmin, 4));

    vmax = _mm_max_ps(vmax, _mm_alignr_epi8(vmax, vmax, 8));
    vmin = _mm_min_ps(vmin, _mm_alignr_epi8(vmin, vmin, 8));

    _mm_store_ss(outMax, vmax);
    _mm_store_ss(outMin, vmin);
}
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Benchmark (processing 100,000,000 random floats): vec_minmax_stride_8: 0.04149 sec (100.0%), vec_minmax: 0.032326 sec (77.9%), scalar: 0.038621 sec (93.1%). All three functions use a stride of 8. Still, your vec_minmax_stride_8 function looks very elegant. Maybe it could be faster if upgraded to 256 bit vectors? – Sebastian Sep 20 '16 at 19:23
  • Thanks for feedback and sorry the SSE version didn't help. AVX is a curate's egg unfortunately, and double width operations do not necessarily result in any throughput improvement. I suspect we're mainly memory bandwidth limited given the size of the data set. I would expect a non-strided SIMD routine and maybe also similar with stride=2 could be worthwhile, but when you get to stride>=4 then it's purely down to how fast you can load the cache lines. – Paul R Sep 20 '16 at 19:39
  • One possible idea - do you always process the same data at more than one different stride ? If so we could generate max/min for two or more different strides in one pass, which would amortise the cost of the memory loads. – Paul R Sep 20 '16 at 19:40
  • 1
    No, I only go thru it once. And the stride has to be variable and might have to be even larger than 8. I suspect that a trivial scalar approach combined with a clever scheme to pick the highest possible stride is the fastest way to do it. But cool stuff anyway! Looking forward to learn more about SIMD. – Sebastian Sep 20 '16 at 20:47
  • Oh well - we tried - I suggest SIMD for non-strided and stride=2 then, and drop back to scalar for everything else. Good luck! – Paul R Sep 20 '16 at 20:53
0

Except where the amount of computation is high -- this example is not such a case -- strides are typically death for vector architectures. Vector loads and stores do not work that way, and it is a lot, lot of work to load the lanes individually. You are usually better off with scalar in such cases, though some cleverness might let you beat scalar in some cases by small margins.

The way you go fast here with vector intrinsics is to find the min/max for several positions at once. For example, if we had a RGBA floating point image, then find the min/max for r,g,b,a all at the same time, and return four mins and four maxes at the end. It's not much faster than the code you have, but you get more work out of it -- assuming the work is needed.

Another method is to keep a decimated copy of your data set around and run the filter over the reduced sized variants as needed. This will use more memory, but with factor of two decimation, it is less than twice as much (1/3 for 2D, c.f. mipmaps). Here again, it only useful if you intend to do this a lot.

Ian Ollmann
  • 1,592
  • 9
  • 16