Fast implementation of covariance of two 8-bit arrays

Question

I need to compare a big amount of similar images of small size (up to 200x200). So I try to implement SSIM (Structural similarity see https://en.wikipedia.org/wiki/Structural_similarity ) algorithm. SSIM requires calculation of covariance of two 8-bit gray images. A trivial implementation look like:

float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
    float sum = 0;
    for(size_t i = 0; i < size; ++i)
        sum += (x[i] - averageX) * (y[i] - averageY);
    return sum / size;
}

But it has poor performance. So I hope to improve it with using SIMD or CUDA (I heard that it can be done). Unfortunately I have no experience to do this. How it will look? And where I have to go?

Did you allow your compiler to aggressively optimize (e.g. `-funsafe-math-optimizations` or `-Ofast` on gcc)? If you don't, the compiler can't do much about the code, as it can't vectorize due to floating-point math not being associative. — EOF, Feb 05 '16 at 11:04
Great point: with `-O3 -ffast-math -march=haswell`, [clang auto-vectorizes the OP's scalar code a lot like Ermlg's earlier answer](http://goo.gl/Db39jB). It uses `pmovzx` loads, and FMA, and takes advantage of AVX2 to use 256b vectors. That godbolt link has some improvements I'm working on, like a more efficient horizontal sum. — Peter Cordes, Feb 05 '16 at 12:25

score 4 · Accepted Answer · answered Feb 05 '16 at 07:57

I have another nice solution!

At first I want to mention some mathematical formulas:

averageX = Sum(x[i])/size;
averageY = Sum(y[i])/size;

And therefore:

Sum((x[i] - averageX)*(y[i] - averageY))/size = 

Sum(x[i]*y[i])/size - Sum(x[i]*averageY)/size - 
Sum(averageX*y[i])/size + Sum(averageX*averageY)/size = 

Sum(x[i]*y[i])/size - averageY*Sum(x[i])/size - 
averageX*Sum(y[i])/size + averageX*averageY*Sum(1)/size =   

Sum(x[i]*y[i])/size - averageY*averageX - 
averageX*averageY + averageX*averageY = 

Sum(x[i]*y[i])/size - averageY*averageX;

It allows to modify our algorithm:

float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
    uint32_t sum = 0; // If images will have size greater then 256x256 than you have to use uint64_t.
    for(size_t i = 0; i < size; ++i)
        sum += x[i]*y[i];
    return sum / size - averageY*averageX;
}

And only after that we can use SIMD (I used SSE2):

#include <emmintrin.h>

inline __m128i SigmaXY(__m128i x, __m128i y)
{
    __m128i lo = _mm_madd_epi16(_mm_unpacklo_epi8(x, _mm_setzero_si128()), _mm_unpacklo_epi8(y, _mm_setzero_si128()));
    __m128i hi = _mm_madd_epi16(_mm_unpackhi_epi8(y, _mm_setzero_si128()), _mm_unpackhi_epi8(y, _mm_setzero_si128()));
    return _mm_add_epi32(lo, hi);
}

float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
    uint32_t sum = 0;
    size_t i = 0, alignedSize = size/16*16;
    if(size >= 16)
    {
        __m128i sums = _mm_setzero_si128();
        for(; i < alignedSize; i += 16)
        {
            __m128i _x = _mm_loadu_si128((__m128i*)(x + i));
            __m128i _y = _mm_loadu_si128((__m128i*)(y + i));
            sums = _mm_add_epi32(sums, SigmaXY(_x, _y));
        }
        uint32_t _sums[4];
        _mm_storeu_si128(_sums, sums);
        sum = _sums[0] + _sums[1] + _sums[2] + _sums[3];
    } 
    for(; i < size; ++i)
        sum += x[i]*y[i];
    return sum / size - averageY*averageX;
}

Wow, that's funky that you don't need the average until the end of the loop. That means you could calculate the averages on the fly, too. Either with `psadbw`, or with `_mm_add_epi16` while you have it unpacked for `madd`. — Peter Cordes, Feb 05 '16 at 10:39
Yes of course, there is possible to calculate covariance and moments of first and second order at the same loop. — ErmIg, Feb 05 '16 at 12:24

score 3 · Answer 2 · answered Feb 05 '16 at 05:56

There is a SIMD implementation of the algorithm (I used SSE4.1):

#include <smmintrin.h>

template <int shift> inline __m128 SigmaXY(const __m128i & x, const __m128i & y, __m128 & averageX, __m128 & averageY)
{
    __m128 _x = _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(x, shift)));
    __m128 _y = _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(y, shift)));
    return _mm_mul_ps(_mm_sub_ps(_x, averageX), _mm_sub_ps(_y, averageY))
}

float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
    float sum = 0;
    size_t i = 0, alignedSize = size/16*16;
    if(size >= 16)
    {
        __m128 sums = _mm_setzero_ps();
        __m128 avgX = _mm_set1_ps(averageX);
        __m128 avgY = _mm_set1_ps(averageY);
        for(; i < alignedSize; i += 16)
        {
            __m128i _x = _mm_loadu_si128((__m128i*)(x + i));
            __m128i _y = _mm_loadu_si128((__m128i*)(y + i));
            sums = _mm_add_ps(sums, SigmaXY<0>(_x, _y, avgX, avgY);
            sums = _mm_add_ps(sums, SigmaXY<4>(_x, _y, avgX, avgY);
            sums = _mm_add_ps(sums, SigmaXY<8>(_x, _y, avgX, avgY);
            sums = _mm_add_ps(sums, SigmaXY<12>(_x, _y, avgX, avgY);
        }
        float _sums[4];
        _mm_storeu_ps(_sums, sums);
        sum = _sums[0] + _sums[1] + _sums[2] + _sums[3];
    }
    for(; i < size; ++i)
        sum += (x[i] - averageX) * (y[i] - averageY);
    return sum / size;
}

I hope that it will useful for you.

(I started this comment ages ago, before you posted the other answer). It might be nicer to use `_mm_cvtepu8_epi32` (`PMOVZX`) as a load, instead of shifting the result of a full vector load, but that's difficult to accomplish with intrinsics. It shouldn't make much difference when the input pointers are 16B-aligned. [gcc and clang make some fairly nice code out of this](http://goo.gl/VQiPLn). gcc will even use FMA when available, combining the `mul_ps` in `SigmaXY` with the `add_ps` into the accumulator. — Peter Cordes, Feb 15 '16 at 18:47
Using multiple accumulators might help allow more overlap of independent work: because the loop-carried dependency on `sums` is not that short compared to the throughput for computing independent `SigmaXY` results. Also, the horizontal sum can be done more efficiently: http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026 — Peter Cordes, Feb 15 '16 at 18:47
@Peter It doesn't matter because there is a best solution without using of float point numbers. — ErmIg, Feb 16 '16 at 05:44
@ErmIg: agreed. I already had most of that typed before you posted the integer version, but got side-tracked with optimal horizontal-add. I eventually decided to post it anyway as comments, instead of just abandoning, since anyone that learns anything from this answer might benefit from seeing those comments. — Peter Cordes, Feb 16 '16 at 05:54

Fast implementation of covariance of two 8-bit arrays

2 Answers2