SSE2 convert packed RGB to RGBA pixels (add a 4th 0xFF byte after every 3 bytes)

Question

My application is processing a computationally heavy near realtime workload which I need to speed up as much as possible. The software is written in C++ and only targeting Linux.

My program grabs a 6.4 megapixel RAW data buffer off a specialist astronomical camera which is capable of delivering 25 fps at 3096px x 2080px. This stream is then debayered, in realtime, by using a high quality linear interpolation debayering algorithm. I know that a HQ linear interpolation debayering algorithm is always going to be computationally heavy but there are other areas of my program that I would like to speed up.

Once the stream has been debayered, I need to convert the RGB buffer (created from debayering) into a RGBA buffer because it's my understanding (proven by profiling) that GPUs operate more efficiently on RGBA pixel buffers. However, I'm happy to stand corrected on this.

Initially, I wrote a very simple for loop (below) which, of course, yielded dreadful results.

// both buffers have uint8_t elements
for(int n = 0, m = 0; n < m_Width * m_Height * 4; n+=4, m+=3)
{
     m_display_buffer[n] = in_buffer[m];
     m_display_buffer[n+1] = in_buffer[m+1];
     m_display_buffer[n+2] = in_buffer[m+2];
     m_display_buffer[n+3] = 255;
}

The above code gave me a frame rate of 13 fps. My next experiment was to initialise the buffer with all elements equal to 255 and then use the following code:

uint8_t *dsp = m_display_buffer;
uint8_t *in_8 = (uint8_t*) in_buffer;

for (int n = 0; n < m_Width * m_Height; n++)
{
    *dsp++ = *in_8++;
    *dsp++ = *in_8++;
    *dsp++ = *in_8++;
    *dsp++;
}

The above code significantly sped up the loop; now achieving 23.9 fps running on an i7-7700 laptop. However, running this code on older machines still gives very disappointing frame rates. I know that older machines struggle with debayering but profiling clearly shows that converting to an RGBA buffer is causing significant problems.

I have read that it might be possible to use SSE intrinsics to do this much more efficiently, however, I have zero experience with SSE intrinsics.

I've tried many SSE examples found online but cannot get it to work. I would therefore be grateful if somebody experienced with SSE would be able to help me with this problem.

I cannot target SSE any higher than 2 or possibly 3 because my software might be run on much older hardware.

I would be grateful if somebody would be able to point me in the right direction.

Can you use [SSSE3 for `pshufb`](https://en.wikipedia.org/wiki/SSSE3)? Available on Intel Core2 and newer, but AMD only with Bulldozer and newer, not old PhenomII CPUs (a couple years beyond Core2, and few of which might still be in service). You could of course use runtime dispatching. — Peter Cordes, Sep 04 '18 at 01:07
Is it possible for you to produce output from the debayering process in this format, to avoid another pass over the data? Memory bandwidth has come a *long* way in the past 10 years or so, so an extra pass over the data is not too costly on a Kaby Lake with dual-channel DDR4-2400, but a lot more costly on a Core2 with DDR2-533 or something, or only single-channel memory. (Also, on a modern many-core CPU, single-threaded memory bandwidth is lower than DRAM bandwidth. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020)) — Peter Cordes, Sep 04 '18 at 01:14
Speaking of threads, this problem should multi-thread trivially. (You'd probably want to multi-thread the debayering, too.) Or if you're using a GPU, do the RGB -> RGBA *on the GPU* using its memory bandwidth. — Peter Cordes, Sep 04 '18 at 01:16
You might be able to do something with only SSE2, if you can't use the existing SSSE3 answer (just change the order of `_mm_set_epi8` shuffle control vectors to get RGBA instead of BGRA). I can reopen this if that question isn't close enough of a duplicate, or if anyone wants to post an answer that combines that with debayering, because cache-blocking (or even better, folding more work into one pass) is an important optimization. — Peter Cordes, Sep 04 '18 at 01:26
@PeterCordes Thanks for your comments and sorry for the delay in replying. I have implemented the SSE example from the "duplicate" post but it does not speed up my execution notably. — Amanda, Sep 08 '18 at 22:08
@PeterCores I cannot add the alpha channel during the debayering routines because I need to save as a RGB image file and not RGBA or various post-processing astronomical software will not be compatible with my software. — Amanda, Sep 08 '18 at 22:09
I did try running these routines in a separate thread but as I need to synchronise with the data coming off the camera I don't get any benefit. If I don't mutex lock the buffers then the new frame from the camera is replaced by the one currently being debayered which, of course, causes awful problems. If I mutex lock, then there is no speed gain because one is waiting for the other. — Amanda, Sep 08 '18 at 22:10
With careful locking I might be able to add the alpha channel in a separate thread but definitely not the debayering unless I add much more complexity by using multiple buffers, but multiple buffers of at least 6 megapixel, increases memory usuage. — Amanda, Sep 08 '18 at 22:11
As I commented on the linked duplicate, there are some potential optimizations, and unaligned loads would probably be more optimal for new hardware (Nehalem and later) vs. more ALU merging between aligned loads. Maybe search for other manual-vectorized implementations. An AVX2 implementation on Haswell / Skylake *should* come close to keeping up with L1d cache, if implemented 32-byte unaligned loads that get 9 bytes you want in the low lane and 9 bytes you want in the high lane, setting up for a `_mm256_shuffle_epi8()` in-lane shuffle.) — Peter Cordes, Sep 08 '18 at 22:51
Also consider writing *two* outputs while debayering: RGB and RGBA. This could overlap the memory-bound RGB->RGBA with the (presumably) ALU-bound debayering. If debayering needs to access "nearby" data non-sequentially, maybe use NT stores for the RGBA output stream so it doesn't pollute cache. (If you can write it directly to video RAM, that's idea.) Or if this is just to speed up GPU processing, like I said your best bet might be to use the GPU for format-conversion. — Peter Cordes, Sep 08 '18 at 22:58
Depending on where your bottlenecks are, processing the last frame in one thread while reading the new frame from the camera into another buffer in another thread sounds like a good idea. (Each thread has its own buffer for cache/NUMA locality, and they alternate between processing or receiving. Depending on the camera library, that might work well.) If you don't do a one-pass debayer => RGB & RGBA, cache-blocking will help a lot so you're re-reading RGB data that's still hot in L1d or L2, instead of not re-reading any RGB data until the whole image is debayered and it's mostly evicted. — Peter Cordes, Sep 08 '18 at 23:01

SSE2 convert packed RGB to RGBA pixels (add a 4th 0xFF byte after every 3 bytes)

0 Answers0

Linked