Fast vectorized conversion from RGB to BGRA

Question

In a follow-up to some previous questions on converting RGB to RGBA, and ARGB to BGR, I would like to speed up a RGB to BGRA conversion with SSE. Assume a 32-bit machine, and would like to use intrinsics. I'm having difficulty aligning both source and destination buffers to work with 128-bit registers, and seek for other savvy vectorization solutions.

The routine to be vectorized is as follows...

    void RGB8ToBGRX8(int w, const void *in, void *out)
    {
        int i;
        int width = w;
        const unsigned char *src= (const unsigned char*) in;
        unsigned int *dst= (unsigned int*) out;
        unsigned int invalue, outvalue;

        for (i=0; i<width; i++, src+=3, dst++)
        {
                invalue = src[0];
                outvalue = (invalue<<16);
                invalue = src[1];
                outvalue |= (invalue<<8);
                invalue = src[2];
                outvalue |= (invalue);
                *dst = outvalue | 0xff000000;
        }
      }

This routine gets used primarly for large textures (512KB), so if I can parallelize some of the operations, it may be beneficial to process more pixels at a go. Of course, I'll need to profile. :)

Edit:

My compilation arguments...

gcc -O2 main.c

Are you using the optimization flag for your compiler (which one?)? The compiler will often do a better job of optimizing code, _without_ introducing incorrectness. Which benchmark data have you collected? — Dana the Sane, Aug 25 '11 at 17:17
Not an SSE answer, but have you tried unrolling your loop 4 times such that the input always starts on an aligned address? Then you can read the input a machine word at a time rather than bytewise, with specialized shifting-and-masking for each relative position of the source pixel. As Dana mentions, it is worth seeing how well the compiler performs on high optimization levels (inspect the generated assembler code, in addition to benchmarking), but I doubt it will be aggressive enough to unroll the loop _and_ split the entry point according to the alignment of `in` all by itself. — hmakholm left over Monica, Aug 25 '11 at 17:29
Great questions. It's simply "O2" (NOT O3) with GCC4.6. My benchmark case is a 10K iteration run with 512 as the "width" span. Thanks for the great replies! — Rev316, Aug 25 '11 at 17:54

caf · Accepted Answer · 2018-09-06T05:48:22.443

11

This is an example of using SSSE3 intrinsics to perform the requested operation. The input and output pointers must be 16-byte aligned, and it operates on a block of 16 pixels at a time.

#include <tmmintrin.h>

/* in and out must be 16-byte aligned */
void rgb_to_bgrx_sse(unsigned w, const void *in, void *out)
{
    const __m128i *in_vec = in;
    __m128i *out_vec = out;

    w /= 16;

    while (w-- > 0) {
        /*             0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
         * in_vec[0]   Ra Ga Ba Rb Gb Bb Rc Gc Bc Rd Gd Bd Re Ge Be Rf
         * in_vec[1]   Gf Bf Rg Gg Bg Rh Gh Bh Ri Gi Bi Rj Gj Bj Rk Gk
         * in_vec[2]   Bk Rl Gl Bl Rm Gm Bm Rn Gn Bn Ro Go Bo Rp Gp Bp
         */
        __m128i in1, in2, in3;
        __m128i out;

        in1 = in_vec[0];

        out = _mm_shuffle_epi8(in1,
            _mm_set_epi8(0xff, 9, 10, 11, 0xff, 6, 7, 8, 0xff, 3, 4, 5, 0xff, 0, 1, 2));
        out = _mm_or_si128(out,
            _mm_set_epi8(0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0));
        out_vec[0] = out;

        in2 = in_vec[1];

        in1 = _mm_and_si128(in1,
            _mm_set_epi8(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0, 0, 0, 0, 0, 0, 0, 0));
        out = _mm_and_si128(in2,
            _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff));
        out = _mm_or_si128(out, in1);
        out = _mm_shuffle_epi8(out,
            _mm_set_epi8(0xff, 5, 6, 7, 0xff, 2, 3, 4, 0xff, 15, 0, 1, 0xff, 12, 13, 14));
        out = _mm_or_si128(out,
            _mm_set_epi8(0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0));
        out_vec[1] = out;

        in3 = in_vec[2];
        in_vec += 3;

        in2 = _mm_and_si128(in2,
            _mm_set_epi8(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0, 0, 0, 0, 0, 0, 0, 0));
        out = _mm_and_si128(in3,
            _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff));
        out = _mm_or_si128(out, in2);
        out = _mm_shuffle_epi8(out,
            _mm_set_epi8(0xff, 1, 2, 3, 0xff, 14, 15, 0, 0xff, 11, 12, 13, 0xff, 8, 9, 10));
        out = _mm_or_si128(out,
            _mm_set_epi8(0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0));
        out_vec[2] = out;

        out = _mm_shuffle_epi8(in3,
            _mm_set_epi8(0xff, 13, 14, 15, 0xff, 10, 11, 12, 0xff, 7, 8, 9, 0xff, 4, 5, 6));
        out = _mm_or_si128(out,
            _mm_set_epi8(0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0));
        out_vec[3] = out;

        out_vec += 4;
    }
}

edited Sep 06 '18 at 05:48

answered Aug 26 '11 at 23:35

caf

233,326
40
323
462

Even gcc8.2 `-O3` doesn't optimize the OP's version into a 4-byte load. ICC and clang `-O3` unroll but still don't do anything better than byte-loads + OR https://godbolt.org/z/Ei9C_d. On a Sandybridge-family CPU, gcc's version it will run at best 4 bytes stored per 3 clock cycles, or less if competing for a hyperthread, bottlenecked on the front-end at 4 uops per clock. That's garbage. Hard to imagine a case where this `pshufb` version wouldn't be at least 3 times faster, and easily more depending on memory bandwidth. – Peter Cordes Sep 04 '18 at 01:45
Hmm, looks like some missed optimizations, though. Use `palignr` / `_mm_alignr_epi8` to get four 9-byte windows from 3 aligned loads, instead of AND/AND/OR to merge. Or use `movsd` or `punpcklqdq` to merge high/low halves, or combine low halves. Or especially on Haswell and later (1 shuffle per clock), just do four unaligned loads. Nehalem / K10 and later have efficient unaligned loads. (But page splits still suck until Skylake.) – Peter Cordes Sep 04 '18 at 01:48
@PeterCordes: Yes, you're right - it's possible of course to [tweak the scalar code to get 4 byte loads](https://godbolt.org/z/E3f7dV) but it still does not look fast. I'm not sure what memory bandwidth I was comparing to, 7 years is a long time. The `palignr` optimisation does look good, I might give that a try. – caf Sep 04 '18 at 06:33
Oh, I forgot this was reversing the byte order to BGRA, too, rather than just [SSE2 convert packed RGB to RGBA pixels (add a 4th 0xFF byte after every 3 bytes)](https://stackoverflow.com/q/52157498). Use an endian-reversal function like `__builtin_bswap32(in) | 0xFF000000` to get `mov` + `bswap` + `OR` + `mov`. (But that's still 4 uops total not counting any loop overhead for `pointers +=3*unroll` and `+=4 * unroll`, so we can only approach 1 DWORD store per clock with a huge unroll) Or on Atom/Silvermont (but not Haswell), `movbe` can save a uop. – Peter Cordes Sep 04 '18 at 06:46
@PeterCordes: The palign change actually ends up being a slight pessimisation, I am not sure exactly why. https://godbolt.org/z/Y3-Dbh – caf Sep 05 '18 at 07:15
Probably 1-per-clock shuffle throughput if you're on Haswell/Skylake. Unaligned loads should be better there. I didn't look too carefully at how gcc was compiling those and/and/or blends in your original, but maybe it was doing better than 3 uops. – Peter Cordes Sep 05 '18 at 07:18
@PeterCordes: Yes, I'm testing on SKL. It's using three ops for the AND/AND/OR but is scheduling the first AND very earlier (before the previous shuffle), and interestingly is reordering the last two shuffles and writes to the output array ( https://godbolt.org/z/LyXvSg ). – caf Sep 05 '18 at 13:48
@PeterCordes: It looks like the `pand / pand / por` version performs slightly better than the `palignr` because the latter competes with the `pshufb` for use of execution port 5, whereas the former can distribute across ports 0, 1 and 5. – caf Sep 06 '18 at 04:36
Yeah, Haswell and later only has one shuffle unit, on port 5. It's surprising because you didn't have a p5 bottleneck before, so I was expecting 3 uops that could pick any of p015 to be about as bad as 1 uop for p5, either way basically costing an extra cycle of throughput. But apparently that's not how it works out, at least the way gcc compiles it. But anyway, this is exactly why I said that Haswell and later would do better with 4 (potentially) unaligned loads instead of `palignr`. – Peter Cordes Sep 06 '18 at 04:47
You can save on masks by using `and` / `andn` with the same mask, instead of `and` with an inverse mask. Or there's `movq xmm,xmm`, but that only helps to zero-extend the low 8 bytes, not for extracting the high 8. (There is an intrinsic for it, though.). Merging the high 8 of one vector with the low 8 of another takes one `movsd xmm,xmm` instruction, but that's a shuffle. AVX2 `vpblendd` is very efficient, and so is SSE4.1 `blendps/pd` (1 uop for any port). But `pblendw` only runs on port 5, so you're stuck with bypass delays or the shuffle port for efficient integer blends until AVX2. – Peter Cordes Sep 06 '18 at 06:27

score 3 · Answer 2 · answered Aug 26 '13 at 03:47

I personally found that implementing the following gave me the best result for converting BGR-24 to ARGB-32.

This code runs at about 8.8ms on an image whereas the 128-bit vectorization code presented above came in at 14.5ms per image.

void PixelFix(u_int32_t *buff,unsigned char *diskmem)
{
    int i,j;
    int picptr, srcptr;
    int w = 1920;
    int h = 1080;

    for (j=0; j<h; j++) {
        for (i=0; i<w; i++) {
            buff[picptr++]=(diskmem[srcptr]<<24) | (diskmem[srcptr+1]<<16) | diskmem[srcptr+2]<<8 | 0xff;
            srcptr+=3;
        }
    }
}

Previously, I had been using this routine (about 13.2ms per image). Here, buff is an unsigned char*.

for (j=0; j<h; j++) {
    int srcptr = (h-j-1)*w*3;  // remove if you don't want vertical flipping
    for (i=0; i<w; i++) {
        buff[picptr+3]=diskmem[srcptr++]; // b
        buff[picptr+2]=diskmem[srcptr++]; // g
        buff[picptr+1]=diskmem[srcptr++]; // r
        buff[picptr+0]=255;               // a
        picptr+=4;
    }
}

Running a 2012 MacMini 2.6ghz/i7.

Further to this, one may wish to look into Apple's recent vImage conversion API..., specifically routines like "vImageConvert_RGB888toARGB8888" for converting from 24-bit RGB to 32-bit ARGB (or BGRA). https://developer.apple.com/library/mac/documentation/Performance/Reference/vImage_conversion/Reference/reference.html#//apple_ref/c/func/vImageConvert_RGB888toARGB8888 — zzyzy, Jan 14 '14 at 15:41
FWIW I can't replicate that result - testing on i5-6200U (Skylake) with gcc 6.3.0 using `-mssse3 -O3` I get 1.57ms per (1920x1080) image for `PixelFix` and 1.07ms per image for `rgb_to_bgrx_sse`. — caf, Sep 06 '18 at 05:55

score 3 · Answer 3 · answered Jun 10 '14 at 17:34

3

Ummm... using vImageConvert_RGB888toARGB8888 is VERY VERY fast (15X speedup).

Above PixelFix code (≈6ms per image, now on newer hardware)

6.373520 ms
6.383363 ms
6.413560 ms
6.278606 ms
6.293607 ms
6.368118 ms
6.338904 ms
6.389385 ms
6.365495 ms

Using vImageConvert_RGB888toARGB888, threaded (on newer hardware)

0.563649 ms
0.400387 ms
0.375198 ms
0.360898 ms
0.391278 ms
0.396797 ms
0.405534 ms
0.386495 ms
0.367621 ms

Need I say more?

answered Jun 10 '14 at 17:34

zzyzy

973
6
21

1

One followup... using the single-threaded 128-bit vector code "rgb_to_bgrx_sse" above gave results in the 11ms range for the same sized I/O buffers. vImage is the clear winner here. – zzyzy Jun 11 '14 at 02:10

score 2 · Answer 4 · edited Oct 17 '22 at 18:42

2

I don't have a complete understanding of what you're asking for, and I'm eagerly awaiting a proper response to your question. In the meantime, I've come up with am implementation that is roughly 8 to 10% faster on average. I'm running Win7 64bit, using VS2010, compiling with C++ for release with the fast option.

#pragma pack(push, 1)
    struct RGB {
        unsigned char r, g, b;
    };
    
    struct BGRA {
        unsigned char b, g, r, a;
    };
#pragma pack(pop)

    void RGB8ToBGRX8(int width, const void* in, void* out)
    {
        const RGB* src = (const RGB*)in;
        BGRA* dst = (BGRA*)out; 
        do {        
            dst->r = src->r;
            dst->g = src->g;
            dst->b = src->b;
            dst->a = 0xFF;
            src++;
            dst++;
        } while (--width);
    }

My motivation for using structs is to allow the compiler to efficiently as possible advance the pointers src and dst. Another motivation is to limit the number of arithmetic operations.

edited Oct 17 '22 at 18:42

miken32

42,008
16
111
154

answered Aug 25 '11 at 18:28

Jack

1,477
15
20

No worries Jack! If you could clarify which piece you may not be understanding, I can try and elaborate. :) – Rev316 Aug 25 '11 at 18:49
What do you mean about using SSE? I think it means instructing the compiler to use specific optimization technique(s), and if that's the case maybe its not worth tweaking the code by hand at all. You also say you would like to use intrinsics, what do you mean? However, I have good grasp of parallelization. – Jack Aug 25 '11 at 18:56
Oh. I was referring to the vectorization intristics of using SSE2/3, or SSSEE. Mostly the padding/masking ops, as I've seen elegant solutions with other image conversions. Now, I know GCC4.x has several compilation flags that help here, but I'm uncertain of which and/or if it's better. Maybe your expertise would be helpful here. – Rev316 Aug 25 '11 at 19:06
Okay I'm closer to understanding. No sorry, I'm not an expert with gcc. – Jack Aug 25 '11 at 19:10

Fast vectorized conversion from RGB to BGRA

4 Answers4

Linked

Related