How to make use of SIMD capability for sum of squared differences between 8-bit components of RGBA pixels?

Question

The below code is trying to extract the red, green and blue channel of a pixel value and performing an arithmetic with another set of RGB values. It seems that code is slow around the logic where its trying to perform the squaring and addition.

What would be the possibility to replace it with a faster version as this logic doesn't seems to be using SIMD capabilities at all.

typedef struct {
        unsigned char b, g, r, a;
    } pixel;

            register pixel *pPixel;
            register int i, red1, green1, blue1, alpha1;
            register int red2, green2, blue2, alpha2;
            register long oldD, newD;

            red1 = GetRed( *pPixel );
            green1 = GetGreen( *pPixel );
            blue1 = GetBlue( *pPixel );
            alpha1 = GetAlpha( *pPixel );
            oldD = 2000000000;
            for ( i = 0; i < newcolors; ++i ) {
                red2 = GetRed( mycolormap[i].acolor );
                green2 = GetGreen( mycolormap[i].acolor );
                blue2 = GetBlue( mycolormap[i].acolor );
                alpha2 = GetAlpha( mycolormap[i].acolor );

                newD = ( red1 - red2 ) * ( red1 - red2 ) +  
                          ( green1 - green2 ) * ( green1 - green2 ) +
                          ( blue1 - blue2 ) * ( blue1 - blue2 ) +
                          ( alpha1 - alpha2 ) * ( alpha1 - alpha2 );
                if ( newD < oldD ) {
                    oldD = newD;
                }
            }

Below section of code seems to be requiring improvement

  newD = ( red1 - red2 ) * ( red1 - red2 ) +  
                              ( green1 - green2 ) * ( green1 - green2 ) +
                              ( blue1 - blue2 ) * ( blue1 - blue2 ) +
                              ( alpha1 - alpha2 ) * ( alpha1 - alpha2 );

@Sebastian : AMD Ryzen but the code will run on Intel PCs as well. — Helena, May 06 '22 at 06:40
SiMD registers could have 256 bit (typically 128..512). You could store 32 unsigned char or 16 short in one register. You would not combine rgba in one register (called horizontally), but keep different pixels of the same color per register (called vertically). To take the difference you could either: a) use unsigned char and calculate max(a, b) - min(a, b); b) use unsigned char and calculate saturated substraction subs(a, b) and subs(b, a) and combine (as one of both is 0) with one of +, or, xor. c) zeroextend to unsigned short, interpret as signed short and just calculate a - b — Sebastian, May 06 '22 at 06:47
@Sebastian : Could you assist with a sample code example so that i could profile? — Helena, May 06 '22 at 06:48
In a second step multiply the differences by themselves to get signed or unsigned short values. Then add up the r, g, b, and a differences to an int. Take the maximum (still with SIMD) of this sum and an oldD difference. After the loop you calculate a horizontal maximum to get one final value. — Sebastian, May 06 '22 at 06:50
@Sebastian : Can you share a code example? You could modify the code i shared. — Helena, May 06 '22 at 06:52
Not enough time at the moment. Perhaps you can start to give it a try. — Sebastian, May 06 '22 at 06:57
@Sebastian - Sorry but your explanation isn't clear. No worries you can focus on your work. — Helena, May 06 '22 at 07:00
You're probably going to want `pmovzx`, `psubw`, and `pmaddwd` as a first step in horizontally summing squared differences within a pixel. Then another shuffle/paddd, then pminud to update a vector of smallest-seen-difference, which you horizontally reduce at the end. ([Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026)) Unfortunately can't use `psadbw` because you need squared differences, not sum of absolute differences. — Peter Cordes, May 06 '22 at 07:09
Did you use a profiler to see where exactly in the generated assembly code it's slow? I'm pretty sure it's not the arithmetic but rather the various calls to the ```GetXyz(...)``` functions. Is the compiler able to inline them? Are you compiling with optimizations? Look at the disassembly. — Jonathan S., May 06 '22 at 09:23
We will have difficulty giving specific advice without a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). Also, this question is posted with a C++ tag but the code looks more like C. There is a C-style typedef of a structure. Also, the loop has separate size (`newcolor`) and array of structures (`mycolormap`) instead of iterating over a `std::vector`. I suggest you create a benchmark, possibly with [Google Benchmark](https://github.com/google/benchmark), then post your updated code into the question. We can then give specific advice. — Daniel Dearlove, May 06 '22 at 12:41
`register int` why on earth do you use it? [The `register` keyword has been deprecated long ago](https://stackoverflow.com/q/20618008/995714) and compilers have ignored it for decades. Do you think x86 even has enough registers to store that huge number of variables? The compiler does better register allocation than you — phuclv, Jun 05 '22 at 03:32

Soonts · Answer 1 · 2022-05-07T16:19:41.330

It’s harder than it seems. Unfortunately for you, automatic vectorizers in C++ compilers are very rarely doing a good job for integer arithmetic, like you have there.

The following implementation only needs SSE4.1. If you have AVX2 possible to improve substantially by upgrading all these vectors to 32-byte ones, however this will complicate a couple things, remainder and final reduction.

I assumed not only you want the minimum dot product, also the index of the pixel. If you only want the minimum dot product, remove bestIndices field and the code which handles that field.

struct alignas( 4 ) Pixel
{
    uint8_t b, g, r, a;
};

// Define __SSE4_1__ macro when building with MSVC for AVX1 or newer ISA
#if defined( _MSC_VER ) && defined( __AVX__ ) && !defined( __SSE4_1__ )
#define __SSE4_1__ 1
#endif

size_t findClosestPixel( const Pixel& ref, const Pixel* rsi, size_t length, int& bestValue )
{
    if( 0 == length )
    {
        bestValue = INT_MAX;
        return ~(size_t)0;
    }

    class Acc
    {
        // The reference pixel we're after, broadcasted and split into low/high pieces in 16-bit lanes
        __m128i lowRef, highRef;
        // The best dot product so far
        __m128i bestSquares = _mm_set1_epi32( INT_MAX );
        // Index of the pixels currently in bestSquares
        __m128i bestIndices = _mm_set1_epi32( -1 );

        const __m128i lowMask = _mm_set1_epi16( 0xFF );

        // For lanes where dp < bestSquares, update bestSquares and bestIndices vectors
        void updateFields( __m128i dp, __m128i indices )
        {
            const __m128i lt = _mm_cmplt_epi32( dp, bestSquares );
#ifndef __SSE4_1__
            bestSquares = _mm_or_si128( _mm_and_si128( lt, dp ), _mm_andnot_si128( lt, bestSquares ) );
            bestIndices = _mm_or_si128( _mm_and_si128( lt, indices ), _mm_andnot_si128( lt, bestIndices ) );
#else
            bestSquares = _mm_min_epi32( dp, bestSquares );
            bestIndices = _mm_blendv_epi8( bestIndices, indices, lt );
#endif
        }

    public:

        Acc( const Pixel& ref )
        {
            __m128i tmp = _mm_set1_epi32( *(const int*)( &ref ) );
            lowRef = _mm_and_si128( tmp, lowMask );
            highRef = _mm_srli_epi16( tmp, 8 );
        }

        // Update the accumulator with another 4 pixels
        void update( __m128i pixels, __m128i indices )
        {
            // Split into two vectors with 16-bit lanes:
            // low contains blue and red channels, high contains green and alpha
            __m128i low = _mm_and_si128( pixels, lowMask );
            __m128i high = _mm_srli_epi16( pixels, 8 );

            // Compute difference with the reference value we're after
            low = _mm_sub_epi16( low, lowRef );
            high = _mm_sub_epi16( high, highRef );

            // Compute squares as 32-bit numbers, add adjacent pairs
            low = _mm_madd_epi16( low, low );
            high = _mm_madd_epi16( high, high );
            // Adding them results in the dot product (sum of squares) for all 4 channels
            __m128i dp = _mm_add_epi32( low, high );

            // Update the state
            updateFields( dp, indices );
        }

        // Compute horizontal minimum across lanes in these accumulators
        uint32_t reduce( int& bestDp )
        {
            // Swap low/high halves
            __m128i s2 = _mm_shuffle_epi32( bestSquares, _MM_SHUFFLE( 1, 0, 3, 2 ) );
            __m128i i2 = _mm_shuffle_epi32( bestIndices, _MM_SHUFFLE( 1, 0, 3, 2 ) );
            updateFields( s2, i2 );

            // Swap even/odd lanes
            s2 = _mm_shuffle_epi32( bestSquares, _MM_SHUFFLE( 2, 3, 0, 1 ) );
            i2 = _mm_shuffle_epi32( bestIndices, _MM_SHUFFLE( 2, 3, 0, 1 ) );
            updateFields( s2, i2 );

            // Return lowest lanes from both vectors
            bestDp = _mm_cvtsi128_si32( bestSquares );
            return (uint32_t)_mm_cvtsi128_si32( bestIndices );
        }
    };

    Acc impl{ ref };

    const size_t lengthAligned = ( length / 4 ) * 4;
    size_t i;
    __m128i currentIndices = _mm_setr_epi32( 0, 1, 2, 3 );
    for( i = 0; i < lengthAligned; i += 4 )
    {
        // Load 4 source pixels
        __m128i src = _mm_loadu_si128( ( const __m128i* )( rsi + i ) );
        // Update things
        impl.update( src, currentIndices );
        // Increment index vector by 4 pixels
        currentIndices = _mm_add_epi32( currentIndices, _mm_set1_epi32( 4 ) );
    }

    const size_t remainder = length % 4;
    if( remainder == 0 )
    {
        // The input was a multiple of 4 pixels
        return impl.reduce( bestValue );
    }

    const int* const pi = (const int*)( rsi + i );
    __m128i rv;
    if( lengthAligned > 0 )
    {
        // We had at least 4 elements on input, can do unaligned load with negative offset
        size_t offset = 4 - remainder;
        currentIndices = _mm_sub_epi32( currentIndices, _mm_set1_epi32( (int)offset ) );
        rv = _mm_loadu_si128( ( const __m128i* )( pi - offset ) );
    }
    else
    {
        // Less than 4 elements on input, doing partial load and broadcasting the last element
        const size_t remainder = length % 4;
        switch( remainder )
        {
        case 1:
            rv = _mm_set1_epi32( pi[ 0 ] );
            break;
        case 2:
            rv = _mm_loadl_epi64( ( const __m128i* )pi );
            rv = _mm_shuffle_epi32( rv, _MM_SHUFFLE( 1, 1, 1, 0 ) );
            break;
        case 3:
            rv = _mm_loadl_epi64( ( const __m128i* )pi );
#ifndef __SSE4_1__
            rv = _mm_unpacklo_epi64( rv, _mm_set1_epi32( pi[ 2 ] ) );
#else
            rv = _mm_insert_epi32( rv, pi[ 2 ], 2 );
            rv = _mm_shuffle_epi32( rv, _MM_SHUFFLE( 2, 2, 1, 0 ) );
#endif
            break;
        }
    }

    impl.update( rv, currentIndices );
    return impl.reduce( bestValue );
}

Note that the code in the question does *not* record which `i` had the min squared difference, just what it was. You probably assumed that was an oversight or obvious missing feature, and leaving it out is a pretty localized optimization of 2 instructions in `updateFields`, and simplifying `reduce`. — Peter Cordes, May 06 '22 at 19:26
@PeterCordes Yeah, I assumed the OP is doing color quantization or something similar, and needs the best index not just the smallest dot product. I included that part regardless because it’s trivial to remove, but relatively hard to implement, at least for people unfamiliar with SIMD. — Soonts, May 06 '22 at 19:32
These aren't dot products, you're subtracting two different things and then squaring that. These are sums of squared differences across the components of one pixel. But yeah I agree it's likely most uses for it will need to know where that pixel was; I was surprised the code in the question *didn't* record `i`. — Peter Cordes, May 06 '22 at 19:35
Your class has `const __m128i lowMask = _mm_set1_epi16( 0xFF );` as a per-instance member? I guess it's ok since the class is only defined and used inside the body of another function, so its members will basically become local vars. (And `static` would actually be worse because compilers are dumb and will make a constructor to copy a constant into the BSS.) — Peter Cordes, May 06 '22 at 19:40
Another strategy for `i=4`, so you'd need a fallback strategy to support tiny lengths (perhaps scalar, or doing a vector load after checking alignment, and byte-shifting, but that takes a lot of code to do efficiently.) — Peter Cordes, May 06 '22 at 19:47
@PeterCordes Yeah, that’s an instance member. In my experience, it’s safe to assume modern compilers gonna remove the complete classes like that, instead placing everything in vector registers. Take a look: https://godbolt.org/z/oM6Tdaqq6 — Soonts, May 06 '22 at 20:08
@PeterCordes Also, good point about the unaligned load tactic for the remainder, updated. — Soonts, May 06 '22 at 20:34
@Soonts - Thanks for the code. However, is that a replacement code for the original problem? This is because i am finding it quite hard to understand due to my limited SIMD knowledge. So i will go line by line to understand what you have done in the solution. Are there any extra points/assumptions i should be aware of while understanding your logic? Also, will this code works as such in Windows 64 bit & Linux 64 bit , intel and arm processors? — Helena, May 07 '22 at 09:30
@Helena Q1 - yes, call the function, passing `*pPixel`, `&mycolormap`, `newcolors`, `oldD` arguments. Q2 - The OS does not matter, but the CPU must support SSE 4.1: only works on Intel or AMD, won't compile for ARM. — Soonts, May 07 '22 at 11:55
@Soonts : And will it ignore/work on machines which don't support SSE 4 or i will run into failures? Any way to determine if SSE4 is supported on machine? — Helena, May 07 '22 at 15:11
@Helena The code won’t compile for ARM. The code will crash with “illegal instruction” runtime exception if the CPU doesn’t support SSE 4.1. Ideally, use `__cpuidex` intrinsic to test the support before calling the function. BTW, SSE 4.1 is from 2007. According to Steam hardware survey, it’s supported by 99.04% computers in the world https://store.steampowered.com/hwsurvey expand “Other Settings” panel. — Soonts, May 07 '22 at 15:30
@Soonts : Thats good to know, So what would be the best way to skip this function of yours where it isn't supported? I think i have to use the original code after checking if SSE4.1 isn't supported. Is this check possible via c++ code? — Helena, May 07 '22 at 15:35
@Helena If you want this to run on ancient processors like the first generation of Intel Core 2, modify the code to only require SSE2. That thing is a part of AMD64 instruction set, guaranteed to be present in all 64-bit PC processors. The only two SSE 4.1 instructions are `_mm_min_epi32` and `_mm_blendv_epi8`, in that case they can be replaced with 3 bitwise instructions each, `_mm_and_si128`, `_mm_andnot_si128`, and `_mm_or_si128`. — Soonts, May 07 '22 at 15:50
@Soonts: `#if 1` is a weird choice, unless maybe you're catering for MSVC. `#ifdef __SSE4_1__` would choose based on what was enabled on the compiler command line for gcc/clang. ([How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?](https://stackoverflow.com/q/28939652)) (Although it wouldn't change for an `__attribute__((target("sse4.1")))` or function multiversioning setup, but neither would `#if 1`). — Peter Cordes, May 07 '22 at 16:06
@PeterCordes Updated. And yes, I’m using VS2022 even when the target platform is Linux, due to usability and the debugger. — Soonts, May 07 '22 at 17:11

How to make use of SIMD capability for sum of squared differences between 8-bit components of RGBA pixels?

1 Answers1

Linked