How to count average of 3 integers via mmx?

Question

I have a problem, hope that you will help. I have a task to perform grayscaling of image (sent from Java) using mmx, xmm or sse commands. I've already done this in C and asm (taking R, G and b using logics and then finding avg), now I need to use mmx/xmm/sse AND to increase performance (otherwise, professor refuses to take it, and tomorrow is an exam day).

Grayscaling is taking R,G and B of one pixel and replacing them with average of R, G and B. It's easy to do this by simply combining three and doing idiv, but there is no division in mmx, so I need to improvise, and I have no ideas.

The problem with xmm is that simple "movaps xmm0, [rel v1]" gives me crash, and I have kinda no time to explore it, so it would be nice to do this via mmx only.

Yesterday I've written something that uses mmx, but it worked 30x slower than C code :( Well, I do not need epic performance neither-just something that works okay.

Any ideas? Maybe division can be done via shifting or something like this? Would really appreciate help, and thanks.

Does it have to be (R+G+B)/3? Also, `movaps` probably crashed because the address was misaligned. — harold, Jul 01 '14 at 15:35
That formula is wrong, it should use weights http://stackoverflow.com/questions/687261/converting-rgb-to-grayscale-intensity — stark, Jul 01 '14 at 15:36
@harold Maybe formula is not fine, but it works and works good (at least in C). I could try something different if you hint on what to try. — JonathanX64, Jul 01 '14 at 15:43
@stark not sure about using floating point in mmx... How to place data and how to perform calculations? — JonathanX64, Jul 01 '14 at 15:45
If you're allowed to weigh G more than R and B and you cheat a little, you can pavgb(pavgb(r, b), g), otherwise you can try the old "division by multiplication" trick (pmulhw by some well-chosen constant (I think), note that this requires a conversion to shorts somewhere). There's no floating point in mmx. — harold, Jul 01 '14 at 15:46
@harold I think I can cheat a little. Yes, that pavgb thing looks nice, will give it a shot. About multiplying-wel, if I'm dividing by 3, I can multiply on 0.33, but I don't know how to store floats and how to calculate with them in asm/mmx :( — JonathanX64, Jul 01 '14 at 15:52
@user3794486 well it works something like this, you multiply by 0x10000/3 (so 0x5555) and then you shift right by 16. But you can leave out the shift by 16 by asking for the high half in the first place. — harold, Jul 01 '14 at 15:54
@harold will it work with 0x100/3 (0x55 then) and shift right on 8 bits? — JonathanX64, Jul 01 '14 at 16:32
In principle it would, but that's a bit inconvenient with mmx, with 0x10000/3 you can get the right shift for free and using 0x100/3 doesn't make the cast to short unnecessary anyway — harold, Jul 01 '14 at 16:33
@harold I'm just not sure if 765(theoretical maximum in this case)*5555h will fit into 16b word. Well, anyway, your solution fits perfectly in this task, I think I will finish this task with multiplying your way :) thank you. — JonathanX64, Jul 01 '14 at 16:40
@harold I'm taking four pixels and fitting them into mmX as 16b words, like mm0=r1,r2,r3,r4, mm1=g1,g2,g3,g4, and so on. Or you think that it would be faster to take two of them and multiply by 0x5555h and not to use right shift? Can you describe this more, please?) — JonathanX64, Jul 01 '14 at 16:43
It's not supposed to fit - the whole point is that you get the upper half of the 32bit product, and then it's already "shifted right by 16" because it's the high half. Maybe you should just do the pavgb thing though, that's a lot simpler and arguably better anyway since it weighs the green component higher. — harold, Jul 01 '14 at 16:43
@harold I've tried pavgw ting and it gives significant errors, depends on source integers. I will use it if I won't finish with multiplying by 3:00 am, I guess, but I've decided to go with multiplying. So, what's the best command to do mul in mmx in this case? I have, like, mm0=av1,av2,av3,av4 and mm1=5555555555555555h (or not?) what command I should use? — JonathanX64, Jul 01 '14 at 16:51
Well that way wouldn't average them, of course. It's approximately as if you weighed them (R/4,G/2,B/4) (but with funny rounding). If it still gives errors when that's your expected value, then something is wrong. For the multiplication, see `pmulhw` — harold, Jul 01 '14 at 17:07
@harold yup, seems like I finished multiplication and all other mmx stuff in this lab and everything works. Thank you again) — JonathanX64, Jul 01 '14 at 17:19
@user3794486 Can you please revise your question to something like: "How to average 3 RGB integers via SSE?". MMX is kind of obsolete... Can you also modify tags: remove nasm, mmx, xmm, and add C and RGB? — Rotem, Jun 25 '16 at 08:04

score 2 · Answer 1 · edited Jun 25 '16 at 07:50

The attached code uses SSE optimization.
The implementation use C intrinsic - no Assembly...

For simplicity I assumed R, G and B are three different planes,
stored in memory as R matrix, G matrix and B matrix, and not in data-ordered r,g,b,r,g,b,r,g,b...
The code uses fixed point implementation for better performance.
Important notices:

Multiply by (1/3) is much for efficient than dividing by 3.
Adding 0.5 before integer cast can be used for rounding a positive value.
Fixed point implementation of (1/3) scaling performed by expanding, scaling and shifting. Example: avg = (sum*scale + rounding) >> 15; [When scale = (1/3)*2^15].
_mm_mulhrs_epi16 intrinsic is doing the above operation: (x*scl + 2^14) >> 15.

Implementation comments include more explanations:

//Calculate elements average of 3 vectors R, G and B, and store result into J.
//Implementation uses SSE intrinsics for performance optimization.
//Use fixed point computations for better performance.
//R - Plain of red pixels:        RRRRRRRRRRRRRRRR
//G - Plain of green pixels:      GGGGGGGGGGGGGGGG
//B - Plain of blue pixels:       BBBBBBBBBBBBBBBB
//image_size: Total number of pixels (width * height).
//J - Destination Grayscale plane: JJJJJJJJJJJJJJJJ

//Limitations:
//1. image_size must be a multiple of 16.
void RgbAverage(const unsigned char R[],
                const unsigned char G[],
                const unsigned char B[],
                int image_size,
                unsigned char J[])
{
    int x;

/*
    //1. Plain C code:
    //--------------------
    for (x = 0; x < image_size; x++)
    {  
        //Add 0.5 for rounding (round instead of floor).
        //Multiply by (1.0/3.0) - much more efficient than dividing by 3.
        J[x] = (unsigned char)(((double)R[x] + (double)G[x] + (double)B[x])*(1.0/3.0) + 0.5);
    }
*/


/*
    //2. Plain C fixed point implementation:
    //   Read this code first, for better understanding the SSE implementation.
    const unsigned int scale = (unsigned int)((1.0/3.0)*(1 << 15) + 0.5); //scale equals 1/3 expanded by 2^15.
    const unsigned int roundig_ofs = (unsigned int)(1 << 14);   //Offset of 2^14 for rounding (equals 0.5 expanded by 2^15).

    for (x = 0; x < image_size; x++)
    {  
        unsigned int r0 = (unsigned int)R[x];
        unsigned int g0 = (unsigned int)G[x];
        unsigned int b0 = (unsigned int)B[x];
        unsigned int sum = r0 + g0 + b0;

        //Multiply by (1/3) with rounding:
        //avg = (sum*(1/3)*2^15 + 2^14) / 2^15   = floor(sum*(1/3) + 0.5)
        unsigned int avg = (sum * scale + roundig_ofs) >> 15;
        J[x] = (unsigned char)avg;
    }
*/


    //3. SSE optimized implementation:
    const unsigned int scale = (unsigned int)((1.0/3.0)*(1 << 15) + 0.5); //scale equals 1/3 expanded by 2^15.
    const __m128i vscl = _mm_set1_epi16((short)scale);  //8 packed int16 scale elements - scl_scl_scl_scl_scl_scl_scl_scl

    //Process 16 pixels per iteration.
    for (x = 0; x < image_size; x += 16)
    {  
        __m128i r15_to_r0 = _mm_loadu_si128((__m128i*)&R[x]);    //Load 16 uint8 R elements.
        __m128i g15_to_g0 = _mm_loadu_si128((__m128i*)&G[x]);    //Load 16 uint8 G elements.
        __m128i b15_to_b0 = _mm_loadu_si128((__m128i*)&B[x]);    //Load 16 uint8 B elements.

        //Unpack uint8 elements to uint16 elements:
        __m128i r7_to_r0    = _mm_unpacklo_epi8(r15_to_r0, _mm_setzero_si128());    //Lower 8 R elements
        __m128i r15_to_r8   = _mm_unpackhi_epi8(r15_to_r0, _mm_setzero_si128());    //Upper 8 R elements

        __m128i g7_to_g0    = _mm_unpacklo_epi8(g15_to_g0, _mm_setzero_si128());     //Lower 8 G elements
        __m128i g15_to_g8   = _mm_unpackhi_epi8(g15_to_g0, _mm_setzero_si128());    //Upper 8 G elements

        __m128i b7_to_b0    = _mm_unpacklo_epi8(b15_to_b0, _mm_setzero_si128());    //Lower 8 B elements
        __m128i b15_to_b8   = _mm_unpackhi_epi8(b15_to_b0, _mm_setzero_si128());    //Upper 8 B elements

        __m128i sum7_to_sum0    = _mm_add_epi16(_mm_add_epi16(r7_to_r0, g7_to_g0), b7_to_b0);       //Lower 8 sum elements.
        __m128i sum15_to_sum8   = _mm_add_epi16(_mm_add_epi16(r15_to_r8, g15_to_g8), b15_to_b8);    //Upper 8 sum elements.

        //Most important trick:
        //_mm_mulhrs_epi16 intrinsic does exactly what we wanted i.e: avg = (sum*scl + (1<<14)) >> 15.
        //Each SSE instruction perform the described operation on 8 elements.
        __m128i avg7_to_avg0    = _mm_mulhrs_epi16(sum7_to_sum0, vscl);     //Lower 8 average elements.
        __m128i avg15_to_avg8   = _mm_mulhrs_epi16(sum15_to_sum8, vscl);    //Upper 8 average elements.

        //Pack the result to 16 uint8 elements.
        __m128i j15_to_j0 = _mm_packus_epi16(avg7_to_avg0, avg15_to_avg8);

        _mm_storeu_si128((__m128i*)(&J[x]), j15_to_j0); //Store 16 elements of J.
    }
}

Good suggestion to operate on planar data. As I understand it, it's worth the overhead to convert if you're going to do multiple things with the same image. Otherwise, you could grayscale directly from packed RGB data with more shuffling. Or more easily, from RGBA data, so you can easily use [`pmaddubsw`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512,Other&expand=138,140,3965,3962,3965,3962,3244&text=pmadd) to apply weights and do the first step of a horizontal sum. (component weights are missing from this code). — Peter Cordes, Jun 25 '16 at 08:15

How to count average of 3 integers via mmx?

1 Answers1