Fast method to copy memory with translation - ARGB to BGR

Question

Overview

I have an image buffer that I need to convert to another format. The origin image buffer is four channels, 8 bits per channel, Alpha, Red, Green, and Blue. The destination buffer is three channels, 8 bits per channel, Blue, Green, and Red.

So the brute force method is:

// Assume a 32 x 32 pixel image
#define IMAGESIZE (32*32)

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

ARGB orig[IMAGESIZE];
BGR  dest[IMAGESIZE];

for(x = 0; x < IMAGESIZE; x++)
{
     dest[x].Red = orig[x].Red;
     dest[x].Green = orig[x].Green;
     dest[x].Blue = orig[x].Blue;
}

However, I need more speed than is provided by a loop and three byte copies. I'm hoping there might be a few tricks I can use to reduce the number of memory reads and writes, given that I'm running on a 32 bit machine.

Additional info

Every image is a multiple of at least 4 pixels. So we could address 16 ARGB bytes and move them into 12 RGB bytes per loop. Perhaps this fact can be used to speed things up, especially as it falls nicely into 32 bit boundaries.

I have access to OpenCL - and while that requires moving the entire buffer into the GPU memory, then moving the result back out, the fact that OpenCL can work on many portions of the image simultaneously, and the fact that large memory block moves are actually quite efficient may make this a worthwhile exploration.

While I've given the example of small buffers above, I really am moving HD video (1920x1080) and sometimes larger, mostly smaller, buffers around, so while a 32x32 situation may be trivial, copying 8.3MB of image data byte by byte is really, really bad.

Running on Intel processors (Core 2 and above) and thus there are streaming and data processing commands I'm aware exist, but don't know about - perhaps pointers on where to look for specialized data handling instructions would be good.

This is going into an OS X application, and I'm using XCode 4. If assembly is painless and the obvious way to go, I'm fine traveling down that path, but not having done it on this setup before makes me wary of sinking too much time into it.

Pseudo-code is fine - I'm not looking for a complete solution, just the algorithm and an explanation of any trickery that might not be immediately clear.

It should not make sense to use the GPU for this, unless the data is entering the system from there. You should be able to saturate the memory bus with the CPU. — Stephan Eggermont, Aug 16 '11 at 15:46
I haven't played with anyone's code, but AFAICT noone mentioned the possibility of the equivalent of `for(x = 0; x < IMAGESIZE; x++) { dest[x].Red = orig[x].Red; } for(x = 0; x < IMAGESIZE; x++) { dest[x].Green = orig[x].Green; } for(x = 0; x < IMAGESIZE; x++) { dest[x].Blue = orig[x].Blue; }`. In this case, do the simpler loops overtake the bit twiddling? — Mark Hurd, Jan 24 '15 at 13:01

score 56 · Accepted Answer · edited Apr 19 '20 at 02:00

I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3, ran them 10 times over 32MB of random data and found the averages.

Editor's note: the original inline asm used unsafe constraints, e.g. modifying input-only operands, and not telling the compiler about the side effect on memory pointed-to by pointer inputs in registers. Apparently this worked ok for the benchmark. I fixed the constraints to be properly safe for all callers. This should not affect benchmark numbers, only make sure the surrounding code is safe for all callers. Modern CPUs with higher memory bandwidth should see a bigger speedup for SIMD over 4-byte-at-a-time scalar, but the biggest benefits are when data is hot in cache (work in smaller blocks, or on smaller total sizes).

In 2020, your best bet is to use the portable _mm_loadu_si128 intrinsics version that will compile to an equivalent asm loop: https://gcc.gnu.org/wiki/DontUseInlineAsm.

Also note that all of these over-write 1 (scalar) or 4 (SIMD) bytes past the end of the output, so do the last 3 bytes separately if that's a problem.

--- @PeterCordes

The first version uses a C loop to convert each pixel separately, using the OSSwapInt32 function (which compiles to a bswap instruction with -O3).

void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {
    unsigned x;
    for(x = 0; x < imageSize; x++) {
        *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);
        // warning: strict-aliasing UB.  Use memcpy for unaligned loads/stores
    }
}

The second method performs the same operation, but uses an inline assembly loop instead of a C loop.

void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {
    asm volatile ( // has to be volatile because the output is a side effect on pointed-to memory
        "0:\n\t"                   // do {
        "movl   (%1),%%eax\n\t"
        "bswapl %%eax\n\t"
        "movl   %%eax,(%0)\n\t"    // copy a dword byte-reversed
        "add    $4,%1\n\t"         // orig += 4 bytes
        "add    $3,%0\n\t"         // dest += 3 bytes
        "dec    %2\n\t"
        "jnz    0b"                // }while(--imageSize)
        : "+r" (dest), "+r" (orig), "+r" (imageSize)
        : // no pure inputs; the asm modifies and dereferences the inputs to use them as read/write outputs.
        : "flags", "eax", "memory"
    );
}

The third version is a modified version of just a poseur's answer. I converted the built-in functions to the GCC equivalents and used the lddqu built-in function so that the input argument doesn't need to be aligned. (Editor's note: only P4 ever benefited from lddqu; it's fine to use movdqu but there's no downside.)

typedef char v16qi __attribute__ ((vector_size (16)));
void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    v16qi mask = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));
    }
}

Finally, the fourth version is the inline assembly equivalent of the third.

void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    static const int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
    asm volatile (
        "lddqu  %3,%%xmm1\n\t"
        "0:\n\t"
        "lddqu  (%1),%%xmm0\n\t"
        "pshufb %%xmm1,%%xmm0\n\t"
        "movdqu %%xmm0,(%0)\n\t"
        "add    $16,%1\n\t"
        "add    $12,%0\n\t"
        "sub    $4,%2\n\t"
        "jnz    0b"
        : "+r" (dest), "+r" (orig), "+r" (imagesize)
        : "m" (mask)  // whole array as a memory operand.  "x" would get the compiler to load it
        : "flags", "xmm0", "xmm1", "memory"
    );
}

(These all compile fine with GCC9.3, but clang10 doesn't know __builtin_ia32_pshufb128; use _mm_shuffle_epi8.)

On my 2010 MacBook Pro, 2.4 Ghz i5 (Westmere/Arrandale), 4GB RAM, these were the average times for each:

Version 1: 10.8630 milliseconds
Version 2: 11.3254 milliseconds
Version 3:  9.3163 milliseconds
Version 4:  9.3584 milliseconds

As you can see, the compiler is good enough at optimization that you don't need to write assembly. Also, the vector functions were only 1.5 milliseconds faster on 32MB of data, so it won't cause much harm if you want to support the earliest Intel macs, which didn't support SSSE3.

Edit: liori asked for standard deviation information. Unfortunately, I hadn't saved the data points, so I ran another test with 25 iterations.

              Average    | Standard Deviation
Brute force: 18.01956 ms | 1.22980 ms (6.8%)
Version 1:   11.13120 ms | 0.81076 ms (7.3%)
Version 2:   11.27092 ms | 0.66209 ms (5.9%)
Version 3:    9.29184 ms | 0.27851 ms (3.0%)
Version 4:    9.40948 ms | 0.32702 ms (3.5%)

Also, here is the raw data from the new tests, in case anyone wants it. For each iteration, a 32MB data set was randomly generated and run through the four functions. The runtime of each function in microseconds is listed below.

Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
Version 1:   10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
Version 2:   10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
Version 3:    9036  9619  9341  8970  9453  9758  9043 10114  9243  9027  9163  9176  9168  9122  9514  9049  9161  9086  9064  9604  9178  9233  9301  9717  9156
Version 4:    9339 10119  9846  9217  9526  9182  9145 10286  9051  9614  9249  9653  9799  9270  9173  9103  9132  9550  9147  9157  9199  9113  9699  9354  9314

I'd be interested in how newer versions of GCC perform. Some of the optimizations introduced between 4.2 and 4.6 are impressive. Also, could you calculate standard deviation for these times? — liori, Jul 24 '11 at 12:22
@liori I added some more data, including standard deviation. Sorry, but I don't have a version newer than 4.2.1 right now. I'll update in the future after I get a newer version. — ughoavgfhw, Jul 24 '11 at 17:11
Possibly a dumb question: for comparison, how fast/slow is the code supplied in the question on your machine? — YXD, Jul 28 '11 at 16:19
@MrE Not a dumb question. I have added a data set for the supplied code, and it took about twice as long as `pshufb`. — ughoavgfhw, Jul 28 '11 at 19:45
@ughoavgfhw has demonstrated that it's mostly memory bound. It would be interesting to experiment with explicit prefetch in one of those loops to see if that helps. — Ben Jackson, Jul 29 '11 at 20:30
Why the third version is faster, if the fourth one is an assembly equivalent? — Camilo Martin, Jan 17 '12 at 17:30
@Camilo The compiler can optimize the C code, but not the inline assembly, so the assembly version could have unnecessary padding, such as saving unused registers. The difference between the two is less than the standard deviation, so it could also be caused by outside influences. — ughoavgfhw, Jan 17 '12 at 18:30
I fixed the unsafe constraints on the inline asm. This kind of thing would have broken code that inlined these helper functions. Just letting you know that it was a somewhat large edit, and I ended up writing a new section near the top to point out that the original was unsafe. IDK if that's useful to get anyone who copy/pasted this to go back and fix their copy, or as a disclaimer that the code in the answer isn't *precisely* what you benchmarked. But maybe as a general warning that inline asm is usually a bad idea and compilers can generate the same asm for you with intrinsics. — Peter Cordes, Apr 19 '20 at 02:04

just a poseur · Answer 2 · 2011-07-24T01:43:56.363

25

The obvious, using pshufb.

#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>

// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 4 == 0);
    __m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        _mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));
    }
}

edited Jul 24 '11 at 01:43

answered Jul 24 '11 at 01:22

just a poseur

751
4
4

1

+1 this is almost sure to be optimal. It may be possible to get the compiler to generate the same or similar code without using non-portable intrinsics though... – R.. GitHub STOP HELPING ICE Jul 24 '11 at 02:55
1

An explanation of the magic numbers for _mm_set_epi8 would be appreciated. – hiddensunset4 Jul 28 '11 at 03:30
1

@Daniel, take a look at my answer. – MSN Jul 28 '11 at 03:52

MSN · Answer 3 · 2011-07-26T03:46:38.687

Combining just a poseur's and Jitamaro's answers, if you assume that the inputs and outputs are 16-byte aligned and if you process pixels 4 at a time, you can use a combination of shuffles, masks, ands, and ors to store out using aligned stores. The main idea is to generate four intermediate data sets, then or them together with masks to select the relevant pixel values and write out 3 16-byte sets of pixel data. Note that I did not compile this or try to run it at all.

EDIT2: More detail about the underlying code structure:

With SSE2, you get better performance with 16-byte aligned reads and writes of 16 bytes. Since your 3 byte pixel is only alignable to 16-bytes for every 16 pixels, we batch up 16 pixels at a time using a combination of shuffles and masks and ors of 16 input pixels at a time.

From LSB to MSB, the inputs look like this, ignoring the specific components:

s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333

and the ouptuts look like this:

d[0]: 000 000 000 000 111 1
d[1]:  11 111 111 222 222 22
d[2]:   2 222 333 333 333 333

So to generate those outputs, you need to do the following (I will specify the actual transformations later):

d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))

Now, what should combine_<x> look like? If we assume that d is merely s compacted together, we can concatenate two s's with a mask and an or:

combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))

where (1 means select the left pixel, 0 means select the right pixel): mask(0)= 111 111 111 111 000 0 mask(1)= 11 111 111 000 000 00 mask(2)= 1 111 000 000 000 000

But the actual transformations (f_<x>_low, f_<x>_high) are actually not that simple. Since we are reversing and removing bytes from the source pixel, the actual transformation is (for the first destination for brevity):

d[0]= 
    s[0][0].Blue s[0][0].Green s[0][0].Red 
    s[0][1].Blue s[0][1].Green s[0][1].Red 
    s[0][2].Blue s[0][2].Green s[0][2].Red 
    s[0][3].Blue s[0][3].Green s[0][3].Red
    s[1][0].Blue s[1][0].Green s[1][0].Red
    s[1][1].Blue

If you translate the above into byte offsets from source to dest, you get: d[0]= &s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5 &s[0]+11 &s[0]+10 &s[0]+9 &s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7

(If you take a look at all the s[0] offsets, they match just a poseur's shuffle mask in reverse order.)

Now, we can generate a shuffle mask to map each source byte to a destination byte (X means we don't care what that value is):

f_0_low=  3 2 1  7 6 5  11 10 9  15 14 13  X X X  X
f_0_high= X X X  X X X   X  X X   X  X  X  3 2 1  7

f_1_low=    6 5  11 10 9  15 14 13  X X X   X X X  X  X
f_1_high=   X X   X  X X   X  X  X  3 2 1   7 6 5  11 10

f_2_low=      9  15 14 13  X  X  X  X X X   X  X  X  X  X  X
f_2_high=     X   X  X  X  3  2  1  7 6 5   11 10 9  15 14 13

We can further optimize this by looking the masks we use for each source pixel. If you take a look at the shuffle masks that we use for s[1]:

f_0_high=  X  X  X  X  X  X  X  X  X  X  X  X  3  2  1  7
f_1_low=   6  5 11 10  9 15 14 13  X  X  X  X  X  X  X  X

Since the two shuffle masks don't overlap, we can combine them and simply mask off the irrelevant pixels in combine_, which we already did! The following code performs all these optimizations (plus it assumes that the source and destination addresses are 16-byte aligned). Also, the masks are written out in code in MSB->LSB order, in case you get confused about the ordering.

EDIT: changed the store to _mm_stream_si128 since you are likely doing a lot of writes and we don't want to necessarily flush the cache. Plus it should be aligned anyway so you get free perf!

#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>

// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 16 == 0);

    __m128i shuf0 = _mm_set_epi8(
        -128, -128, -128, -128, // top 4 bytes are not used
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel

    __m128i shuf1 = _mm_set_epi8(
        7, 1, 2, 3, // top 4 bytes go to the first pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel

    __m128i shuf2 = _mm_set_epi8(
        10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9); // bottom 4 go to third pixel

    __m128i shuf3 = _mm_set_epi8(
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
        -128, -128, -128, -128); // unused

    __m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
    __m128i mask1 = _mm_set_epi32(0,  0, -1, -1);
    __m128i mask2 = _mm_set_epi32(0,  0,  0, -1);

    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 64, dest += 48) {
        __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
        __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
        __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
        __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);

        _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
        _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
        _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
    }
}

Any chance you can provide the shuffles for BGRA to RGB? I can't wrap my head around how all this works. — Geoffrey, Nov 16 '17 at 10:54

score 11 · Answer 4 · edited Jul 28 '11 at 07:40

I am coming a little late to the party, seeming that the community has already decided for poseur's pshufb-answer but distributing 2000 reputation, that is so extremely generous i have to give it a try.

Here's my version without platform specific intrinsics or machine-specific asm, i have included some cross-platform timing code showing a 4x speedup if you do both the bit-twiddling like me AND activate compiler-optimization (register-optimization, loop-unrolling):

#include "stdlib.h"
#include "stdio.h"
#include "time.h"

#define UInt8 unsigned char

#define IMAGESIZE (1920*1080) 
int main() {
    time_t  t0, t1;
    int frames;
    int frame; 
    typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
    typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

    ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB));
    if(!orig) {printf("nomem1");}
    BGR* dest = malloc(IMAGESIZE*sizeof(BGR));
    if(!dest) {printf("nomem2");}

    printf("to start original hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    for(frame = 0; frame<frames; frame++) {
        int x; for(x = 0; x < IMAGESIZE; x++) {
            dest[x].Red = orig[x].Red;
            dest[x].Green = orig[x].Green;
            dest[x].Blue = orig[x].Blue;
            x++;
        }
    }
    t1 = time(0);
    printf("finished original of %u frames in %u seconds\n", frames, t1-t0);

    // on my core 2 subnotebook the original took 16 sec 
    // (8 sec with compiler optimization -O3) so at 60 FPS 
    // (instead of the 1200) this would be faster than realtime 
    // (if you disregard any other rendering you have to do). 
    // However if you either want to do other/more processing 
    // OR want faster than realtime processing for e.g. a video-conversion 
    // program then this would have to be a lot faster still.

    printf("to start alternative hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    unsigned int* reader;
    unsigned int* end = reader+IMAGESIZE;
    unsigned int cur; // your question guarantees 32 bit cpu
    unsigned int next;
    unsigned int temp;
    unsigned int* writer;
    for(frame = 0; frame<frames; frame++) {
        reader = (void*)orig;
        writer = (void*)dest;
        next = *reader;
        reader++;
        while(reader<end) {
            cur = next;
            next = *reader;         
            // in the following the numbers are of course the bitmasks for 
            // 0-7 bits, 8-15 bits and 16-23 bits out of the 32
            temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255); 
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280);
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680);
            *writer = temp;
            reader++;
            writer++;
        }
    }
    t1 = time(0);
    printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0);

    // on my core 2 subnotebook this alternative took 10 sec 
    // (4 sec with compiler optimization -O3)

}

The results are these (on my core 2 subnotebook):

F:\>gcc b.c -o b.exe

F:\>b
to start original hit a key
finished original of 1200 frames in 16 seconds
to start alternative hit a key
finished alternative of 1200 frames in 10 seconds

F:\>gcc b.c -O3 -o b.exe

F:\>b
to start original hit a key
finished original of 1200 frames in 8 seconds
to start alternative hit a key
finished alternative of 1200 frames in 4 seconds

btw, the 1200 frames are with 1920*1080 pixel images of course — Bernd Elkemann, Jul 24 '11 at 08:54

Micromega · Answer 5 · 2011-07-24T01:13:33.777

7

You want to use a Duff's device: http://en.wikipedia.org/wiki/Duff%27s_device. It's also working in JavaScript. This post however it's a bit funny to read http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html. Imagine a Duff device with 512 Kbytes of moves.

edited Jul 24 '11 at 01:13

answered Jul 24 '11 at 00:38

Micromega

12,486
7
35
72

2

Duff's Device is just a strange C-specific way of unrolling a loop. It'll take more than that to get really good performance. – Jul 24 '11 at 02:41
My C and Assembly is a bit rusty but unrolling a loop is better then nothing when you must move everything with the CPU. – Micromega Jul 24 '11 at 02:49
1

@R: I'm not an experiecend C programer. I'm designing web applications. Could you explain, please? What's so funny? – Micromega Jul 28 '11 at 08:18

score 6 · Answer 6 · answered Jul 24 '11 at 05:15

6

In combination with one of the fast conversion functions here, given access to Core 2s it might be wise to split the translation into threads, which work on their, say, fourth of the data, as in this psudeocode:

void bulk_bgrFromArgb(byte[] dest, byte[] src, int n)
{
       thread threads[] = {
           create_thread(bgrFromArgb, dest, src, n/4),
           create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4),
           create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4),
           create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4),
       }
       join_threads(threads);
}

answered Jul 24 '11 at 05:15

Dave

10,964
3
32
54

1

Really? I would expect that it is memory access that is the bottleneck here, not cpu processing, so using additional cores wouldn't gain you anything? – Thomas Padron-McCarthy Jul 24 '11 at 06:40
3

Each additional core comes with an accompanying L1 cache, so although memory is the bottleneck, using more cores may buy you some extra cache to help alleviate it. – Dave Jul 24 '11 at 07:19
Only if your threads run on cores where parts of the arrays are already hot in cache. Like if you had worker threads that just wrote separate chunks of `src`, or are later going to do some more work on the parts of `dest` they just wrote, with the right thread matching up with the right chunk, and hopefully still running on the same CPU core. Otherwise this only helps if a single CPU core can't saturate memory bandwidth (which is the case on a big Xeon where per-core bandwidth is low, but not really on a typical modern quad-core desktop). – Peter Cordes Apr 19 '20 at 02:31

Sebi · Answer 7 · 2012-12-10T08:48:01.663

This assembly function should do, however I don't know if you would like to keep old data or not, this function overrides it.

The code is for MinGW GCC with intel assembly flavour, you will have to modify it to suit your compiler/assembler.

extern "C" {
    int convertARGBtoBGR(uint buffer, uint size);
    __asm(
        ".globl _convertARGBtoBGR\n"
        "_convertARGBtoBGR:\n"
        "  push ebp\n"
        "  mov ebp, esp\n"
        "  sub esp, 4\n"
        "  mov esi, [ebp + 8]\n"
        "  mov edi, esi\n"
        "  mov ecx, [ebp + 12]\n"
        "  cld\n"
        "  convertARGBtoBGR_loop:\n"
        "    lodsd          ; load value from [esi] (4byte) to eax, increment esi by 4\n"
        "    bswap eax ; swap eax ( A R G B ) to ( B G R A )\n"
        "    stosd          ; store 4 bytes to [edi], increment  edi by 4\n"
        "    sub edi, 1; move edi 1 back down, next time we will write over A byte\n"
        "    loop convertARGBtoBGR_loop\n"
        "  leave\n"
        "  ret\n"
    );
}

You should call it like so:

convertARGBtoBGR( &buffer, IMAGESIZE );

This function is accessing memory only twice per pixel/packet (1 read, 1 write) comparing to your brute force method that had (at least / assuming it was compiled to register) 3 read and 3 write operations. Method is the same but implementation makes it more efficent.

`stosd` is slower than `mov` + `add` (https://uops.info/ and https://agner.org/optimize/) even if you didn't then have to correct the pointer with another `add`. The `loop` instruction is [quite slow on Intel CPUs](https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently), like one per 5 cycle throughput which is the major bottleneck for this loop. Also, this violates the calling convention, destroying the caller's ESI and EDI. Use EDX for one of them. Also, declare the arg as `char *buffer` like a normal person, not `uint`. — Peter Cordes, May 17 '22 at 13:21

ruhalde · Answer 8 · 2011-07-28T03:24:51.977

You can do it in chunks of 4 pixels, moving 32 bits with unsigned long pointers. Just think that with 4 32 bits pixels you can construct by shifting and OR/AND, 3 words representing 4 24bits pixels, like this:

//col0 col1 col2 col3
//ARGB ARGB ARGB ARGB 32bits reading (4 pixels)
//BGRB GRBG RBGR  32 bits writing (4 pixels)

Shifting operations are always done by 1 instruction cycle in all modern 32/64 bits processors (barrel shifting technique) so its the fastest way of constructing those 3 words for writing, bitwise AND and OR are also blazing fast.

Like this:

//assuming we have 4 ARGB1 ... ARGB4 pixels and  3 32 bits words,  W1, W2 and W3 to write
// and *dest  its an unsigned long pointer for destination
W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f);
*dest++ = W1;

and so on.... with next pixels in a loop.

You'll need some adjusting with images that are not multiple of 4, but I bet this is the fastest approach of all, without using assembler.

And btw, forget about using structs and indexed access, those are the SLOWER ways of all for moving data, just take a look at a disassembly listing of a compiled C++ program and you'll agree with me.

score 3 · Answer 9 · answered Jul 24 '11 at 01:56

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

Aside from assembly or compiler intrinsics, I might try doing the following, while very carefully verifying the end behavior, as some of it (where unions are concerned) is likely to be compiler implementation dependent:

union uARGB
{
   struct ARGB argb;
   UInt32 x;
};
union uBGRA
{
   struct 
   {
     BGR bgr;
     UInt8 Alpha;
   } bgra;
   UInt32 x;
};

and then for your code kernel, with whatever loop unrolling is appropriate:

inline void argb2bgr(BGR* pbgr, ARGB* pargb)
{
    uARGB* puargb = (uARGB*)pargb;
    uBGRA ubgra;
    ubgra.x = __byte_reverse_32(pargb->x);
    *pbgr = ubgra.bgra.bgr;
}

where __byte_reverse_32() assumes the existence of a compiler intrinsic that reverses the bytes of a 32-bit word.

To summarize the underlying approach:

view ARGB structure as a 32-bit integer
reverse the 32-bit integer
view the reversed 32-bit integer as a (BGR)A structure
let the compiler copy the (BGR) portion of the (BGR)A structure

I've implemented the alike approach in assembly using low-level string handling instructions such as `lodsd` and `stosd`. Unfortunately that trial seems to be useless -- proposed `pshufb` solution scores higher. — eugene_che, Jul 24 '11 at 02:59

score 3 · Answer 10 · answered Jul 28 '11 at 11:15

3

Although you can use some tricks based on CPU usage,

This kind of operations can be done fasted with GPU.

It seems that you use C/ C++... So your alternatives for GPU programming may be ( on windows platform )

DirectCompute ( DirectX 11 ) See this video
Microsoft Research Project Accelerator Check this link
Cuda
"google" GPU programming ...

Shortly use GPU for this kind of array operations for make faster calculations. They are designed for it.

answered Jul 28 '11 at 11:15

Novalis

2,265
6
39
63

2

Don't forget that accessing GPU video memory from CPU bus is slower than moving operations done in CPU memory map. To be faster than main processor, one needs to perform all transformations in video RAM. – ruhalde Aug 04 '11 at 23:58

score 3 · Answer 11 · answered Aug 03 '11 at 16:07

I haven't seen anyone showing an example of how to do it on the GPU.

A while ago I wrote something similar to your problem. I received data from a video4linux2 camera in YUV format and wanted to draw it as gray levels on the screen (just the Y component). I also wanted to draw areas that are too dark in blue and oversaturated regions in red.

I started out with the smooth_opengl3.c example from the freeglut distribution.

The data is copied as YUV into the texture and then the following GLSL shader programs are applied. I'm sure GLSL code runs on all macs nowadays and it will be significantly faster than all the CPU approaches.

Note that I have no experience on how you get the data back. In theory glReadPixels should read the data back but I never measured its performance.

OpenCL might be the easier approach, but then I will only start developing for that when I have a notebook that supports it.

(defparameter *vertex-shader*
"void main(){
    gl_Position    = gl_ModelViewProjectionMatrix * gl_Vertex;
    gl_FrontColor  = gl_Color;
    gl_TexCoord[0] = gl_MultiTexCoord0;
}
")

(progn
 (defparameter *fragment-shader*
   "uniform sampler2D textureImage;
void main()
{
  vec4 q=texture2D( textureImage, gl_TexCoord[0].st);
  float v=q.z;
  if(int(gl_FragCoord.x)%2 == 0)
     v=q.x; 
  float x=0; // 1./255.;
  v-=.278431;
  v*=1.7;
  if(v>=(1.0-x))
    gl_FragColor = vec4(255,0,0,255);
  else if (v<=x)
    gl_FragColor = vec4(0,0,255,255);
  else
    gl_FragColor = vec4(v,v,v,255); 
}
")

enter image description here

*it will be significantly faster than all the CPU approaches.* Not for small buffers, especially ones already hot in L2 cache on a CPU core. But sure, for a lot of large buffers it could be good, especially if it happens while you have the CPU working on something else. If you can efficiently get the data back into the CPU, but GPU -> CPU transfers are generally not super fast. — Peter Cordes, Apr 19 '20 at 02:35

Fast method to copy memory with translation - ARGB to BGR

Overview

Additional info

11 Answers11

Linked