Fast calculation of image moments

Question

I have a mask (8-bit gray image) and I need calculate center of region with given index of the mask. To do this I need calculate moments of first order along axes X and Y for this mask. Currently I'm using next code:

void GetCenter(const uint8_t * mask, size_t stride, size_t width, size_t height, 
               uint8_t index, double * centerX, double * centerY)
{
    uint64_t sum = 0, sumX = 0, sumY = 0;
    for(size_t y = 0; y < height; ++y)
    {
        for(size_t x = 0; x < width; ++x)
        {
            if(mask[x] == index)
            {
                sum++;
                sumX += x;
                sumY += y;
            }               
        }
        mask += stride;
    }
    *centerX = sum ? (double)sumX/sum : 0;
    *centerY = sum ? (double)sumY/sum : 0;
}

And I have a question: Is there any way to improve performance of this algorithm?

I would say that this question should fit better on http://codereview.stackexchange.com/. — Some programmer dude, Jul 09 '15 at 07:50
If you've profiled this and established that it's a significant bottleneck (and that you have all appropriate compiler optimisations enabled) then you might want to consider SIMD optimisation, if you can assume a particular CPU architecture. — Paul R, Jul 09 '15 at 08:00
Even if a cross-platform solution is needed, there are still many [cross-platform SIMD libraries](http://stackoverflow.com/q/2122573/995714) available, although the result will not be as efficient as optimizing for a specific architecture. [Fast Cross-Platform C/C++ Image Processing Libraries](http://stackoverflow.com/q/796364/995714) — phuclv, Jul 09 '15 at 08:15
This isn't a code from my project. I significantly simplified it before posted here. — , Jul 09 '15 at 09:04

ErmIg · Accepted Answer · 2015-07-09T12:00:40.203

There is a way to greatly (more then ten times) improve performance of this algorithm. To do it you need use SIMD instructions of CPU such as (SSE2, AVX2, Altivec, NEON etc.). I wrote an example with using of SSE2 instructions (AVX2 code will be similar to it):

const __m128i K_0 = _mm_setzero_si128();
const __m128i K8_1 = _mm_set1_epi8(1);
const __m128i K16_1 = _mm_set1_epi16(1);
const __m128i K16_8 = _mm_set1_epi16(8);
const __m128i K16_I = _mm_setr_epi16(0, 1, 2, 3, 4, 5, 6, 7);

inline void AddMoments(const __m128i & mask, const __m128i & x, const __m128i & y, 
    __m128i & sumX, __m128i & sumY)
{
    sumX = _mm_add_epi32(sumX, _mm_madd_epi16(_mm_and_si128(mask, x), K16_1));
    sumY = _mm_add_epi32(sumY, _mm_madd_epi16(_mm_and_si128(mask, y), K16_1));
}

inline int ExtractSum(__m128i a)
{
    return _mm_cvtsi128_si32(a) + _mm_cvtsi128_si32(_mm_srli_si128(a, 4)) +
        _mm_cvtsi128_si32(_mm_srli_si128(a, 8)) + _mm_cvtsi128_si32(_mm_srli_si128(a, 12));
}

void GetCenter(const uint8_t * mask, size_t stride, size_t width, size_t height, 
    uint8_t index, double * centerX, double * centerY)
{
    size_t alignedWidth = width & ~(sizeof(__m128i) - 1);
    const __m128i _index = _mm_set1_epi8(index);

    uint64_t sum = 0, sumX = 0, sumY = 0;
    for(size_t y = 0; y < height; ++y)
    {
        size_t x = 0;

        __m128i _x = K16_I;
        __m128i _y = _mm_set1_epi16((short)y);
        __m128i _sum = K_0;
        __m128i _sumX = K_0;
        __m128i _sumY = K_0;
        for(; x < alignedWidth; x += sizeof(__m128i))
        {
            __m128i _mask = _mm_and_si128(_mm_cmpeq_epi8(_mm_loadu_si128((__m128i*)(mask + x)), _index), K8_1);
            _sum = _mm_add_epi64(_sum, _mm_sad_epu8(_mask, K_0));
            AddMoments(_mm_cmpeq_epi16(_mm_unpacklo_epi8(_mask, K_0), K16_1), _x, _y, _sumX, _sumY);
            _x = _mm_add_epi16(_x, K16_8);
            AddMoments(_mm_cmpeq_epi16(_mm_unpackhi_epi8(_mask, K_0), K16_1), _x, _y, _sumX, _sumY);
            _x = _mm_add_epi16(_x, K16_8);
        }
        sum += ExtractSum(_sum);
        sumX += ExtractSum(_sumX);
        sumY += ExtractSum(_sumY);

        for(; x < width; ++x)
        {
            if(mask[x] == index)
            {
                sum++;
                sumX += x;
                sumY += y;
            }               
        }
        mask += stride;
    }
    *centerX = sum ? (double)sumX/sum : 0;
    *centerY = sum ? (double)sumY/sum : 0;
}

P.S. There is a more simple and cross platform way to improve performance with using of external library (http://simd.sourceforge.net/):

void GetCenter(const uint8_t * mask, size_t stride, size_t width, size_t height, 
    uint8_t index, double * centerX, double * centerY)
{
    uint64_t sum, sumX, sumY, sumXX, sumXY, sumYY;
    ::SimdGetMoments(mask, stride, width, height, index, 
        &sum, &sumX, &sumY, &sumXX, &sumXY, &sumYY);
    *centerX = sum ? (double)sumX/sum : 0;
    *centerY = sum ? (double)sumY/sum : 0;
}

An implementation with using of _mm_movemask_epi8 and 8-bit lookup tables:

uint8_t g_sum[1 << 8], g_sumX[1 << 8];
bool Init()
{
    for(int i = 0, n = 1 << 8; i < n; ++i)
    {
        g_sum[i] = 0;
        g_sumX[i] = 0;
        for(int j = 0; j < 8; ++j)
        {
            g_sum[i] += (i >> j) & 1;
            g_sumX[i] += ((i >> j) & 1)*j;
        }
    }
    return true;
}
bool g_inited = Init();

inline void AddMoments(uint8_t mask, size_t x, size_t y, 
                       uint64_t & sum, uint64_t & sumX, uint64_t & sumY)
{
    int value = g_sum[mask];
    sum += value;
    sumX += x * value + g_sumX[mask];
    sumY += y * value;   
}

void GetCenter(const uint8_t * mask, size_t stride, size_t width, size_t height, 
    uint8_t index, double * centerX, double * centerY)
{
    size_t alignedWidth = width & ~(sizeof(__m128i) - 1);
    const __m128i _index = _mm_set1_epi8(index);

    union PackedValue
    {
        uint8_t u8[4];
        uint16_t u16[2];
        uint32_t u32;
    } _mask;

    uint64_t sum = 0, sumX = 0, sumY = 0;
    for(size_t y = 0; y < height; ++y)
    {
        size_t x = 0;

        for(; x < alignedWidth; x += sizeof(__m128i))
        {
            _mask.u32 = _mm_movemask_epi8(_mm_cmpeq_epi8(
                        _mm_loadu_si128((__m128i*)(mask + x)), _index));
            AddMoments(_mask.u8[0], x, y, sum, sumX, sumY);
            AddMoments(_mask.u8[1], x + 8, y, sum, sumX, sumY);
        }

        for(; x < width; ++x)
        {
            if(mask[x] == index)
            {
                sum++;
                sumX += x;
                sumY += y;
            }               
        }
        mask += stride;
    }
    *centerX = sum ? (double)sumX/sum : 0;
    *centerY = sum ? (double)sumY/sum : 0;
}

An implementation with using of _mm_movemask_epi8 and 16-bit lookup tables:

uint16_t g_sum[1 << 16], g_sumX[1 << 16];
bool Init()
{
    for(int i = 0, n = 1 << 16; i < n; ++i)
    {
        g_sum[i] = 0;
        g_sumX[i] = 0;
        for(int j = 0; j < 16; ++j)
        {
            g_sum[i] += (i >> j) & 1;
            g_sumX[i] += ((i >> j) & 1)*j;
        }
    }
    return true;
}
bool g_inited = Init();

inline void AddMoments(uint16_t mask, size_t x, size_t y, 
                       uint64_t & sum, uint64_t & sumX, uint64_t & sumY)
{
    int value = g_sum[mask];
    sum += value;
    sumX += x * value + g_sumX[mask];
    sumY += y * value;   
}

void GetCenter(const uint8_t * mask, size_t stride, size_t width, size_t height, 
    uint8_t index, double * centerX, double * centerY)
{
    size_t alignedWidth = width & ~(sizeof(__m128i) - 1);
    const __m128i _index = _mm_set1_epi8(index);

    union PackedValue
    {
        uint8_t u8[4];
        uint16_t u16[2];
        uint32_t u32;
    } _mask;

    uint64_t sum = 0, sumX = 0, sumY = 0;
    for(size_t y = 0; y < height; ++y)
    {
        size_t x = 0;

        for(; x < alignedWidth; x += sizeof(__m128i))
        {
            _mask.u32 = _mm_movemask_epi8(_mm_cmpeq_epi8(
                        _mm_loadu_si128((__m128i*)(mask + x)), _index));
            AddMoments(_mask.u16[0], x, y, sum, sumX, sumY);
        }

        for(; x < width; ++x)
        {
            if(mask[x] == index)
            {
                sum++;
                sumX += x;
                sumY += y;
            }               
        }
        mask += stride;
    }
    *centerX = sum ? (double)sumX/sum : 0;
    *centerY = sum ? (double)sumY/sum : 0;
}

Performance comparison for 1920x1080 image:

Base version: 8.261 ms; 
1-st optimization:0.363 ms (in 22 times faster);
2-nd optimization:0.280 ms (in 29 times faster);
3-rd optimization:0.299 ms (in 27 times faster);
4-th optimization:0.325 ms (in 25 times faster);

As you can see above the code with using of 8-bit lookup tables has better performance then the code with using of 16-bit lookup tables. But anyway external library is better though it performs additional calculations of the second order moments.

Wow! The first your implementation is more than 10 times faster than my code. — , Jul 09 '15 at 08:56
The second implementation is better. It is faster in more then 20 times then my code. Possible this is the best solution of my problem. — , Jul 09 '15 at 09:26

score 1 · Answer 2 · 2015-07-09T09:09:38.610

1

Another acceleration technique is by run-length coding.

You can decompose the rows in horizontal runs where the mask is active. You can detect the runs on the fly, or precompute them and store the image in that form, if that makes sense.

Then a run can be accumulated as a whole. Let a run start from (X, Y) and have length L, then use

Sum+= L;
SumX+= (2 * X + L + 1) * L;
SumY+= Y * L;

In the end, divide SumX by 2.

The longer the runs, the more effective the trick is.

Using SSE2 or later, you try with the instruction PMOVMSKB—Move Byte Mask.

First compare 16 mask pixels to the (replicated) index value to get 16 comparison results. Then pack these to a 16 bits number, using the magical instruction.

Then, using two precomputed lookup tables, perform the accumulations in scalar mode.

One lookup table gives you the count of active mask pixels, and the other gives you the sum of active mask pixels weighted by their abscissa, i.e the X moment.

Something like

int PackedValue= _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_loadu_si128((__m128i*)&Mask[X]), ReplicatedIndex));
Sum+= Count[PackedValue];
SumX+= X * Count[PackedValue] + MomentX[PackedValue];
SumY+= Y * Count[PackedValue];

Depending on the amount of memory you agree to spend, the lookup tables can have byte indexes (256 entries, use the table twice) or word indexes (65536 entries). In both cases, the count and moment values fit in a single byte (1 to 8/16 and 0 to 28/120 respectively).

AVX implementation is also possible, packing 32 pixels at a time. Lookup tables with doubleword indexes seem unreasonable, though. :-)

edited Jul 09 '15 at 09:09

answered Jul 09 '15 at 08:20

2

Can you explain how these tables (Count and MomentX) were initialized? – Jul 09 '15 at 09:31
1

Example with 8 bit index: 00110010 => Count = 1+1+1 = 3, MomentX = 2+3+6 = 11. – Jul 09 '15 at 10:22
As you see above the code with using of 8-bit lookup tables has better performance then the code with using of 16-bit lookup tables. But anyway external library is better though it performs additional calculations of the second order moments. – ErmIg Jul 09 '15 at 11:59
Is the library using SSE or AVX ? – Jul 09 '15 at 12:21
As follows from description of the library(http://simd.sourceforge.net/) it supports SSE, SSE2, SSSE3, SSE4.1, SSE4.2, AVX and AVX2 for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC. – ErmIg Jul 09 '15 at 12:33
I know, but that doesn't tell what version was used for the benchmarking. – Jul 09 '15 at 13:03
I checked - there was an AVX2 version of the algorithm. – ErmIg Jul 09 '15 at 13:19
For a consistent comparison, the same instruction sets should be used. – Jul 09 '15 at 13:23
May be your solution with using of packed value will be better, but I have some doubts. So _mm256_movemask_epi8 can process 32 bytes of the mask, but after this we should perform 4 scalar processing of result value. And I have view that it can kill all advantages of the first step. – ErmIg Jul 09 '15 at 13:31
Your solution is the best for SSE2 code, but it can't be expanded for AVX2 code with the same efficiency. – ErmIg Jul 09 '15 at 13:36
Why not ? There is a VPMOVMSKB instruction. – Jul 09 '15 at 13:50

Fast calculation of image moments

2 Answers2