Combining just a poseur's and Jitamaro's answers, if you assume that the inputs and outputs are 16-byte aligned and if you process pixels 4 at a time, you can use a combination of shuffles, masks, ands, and ors to store out using aligned stores. The main idea is to generate four intermediate data sets, then or them together with masks to select the relevant pixel values and write out 3 16-byte sets of pixel data. Note that I did not compile this or try to run it at all.
EDIT2: More detail about the underlying code structure:
With SSE2, you get better performance with 16-byte aligned reads and writes of 16 bytes. Since your 3 byte pixel is only alignable to 16-bytes for every 16 pixels, we batch up 16 pixels at a time using a combination of shuffles and masks and ors of 16 input pixels at a time.
From LSB to MSB, the inputs look like this, ignoring the specific components:
s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333
and the ouptuts look like this:
d[0]: 000 000 000 000 111 1
d[1]: 11 111 111 222 222 22
d[2]: 2 222 333 333 333 333
So to generate those outputs, you need to do the following (I will specify the actual transformations later):
d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))
Now, what should combine_<x>
look like? If we assume that d
is merely s
compacted together, we can concatenate two s
's with a mask and an or:
combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))
where (1 means select the left pixel, 0 means select the right pixel):
mask(0)= 111 111 111 111 000 0
mask(1)= 11 111 111 000 000 00
mask(2)= 1 111 000 000 000 000
But the actual transformations (f_<x>_low
, f_<x>_high
) are actually not that simple. Since we are reversing and removing bytes from the source pixel, the actual transformation is (for the first destination for brevity):
d[0]=
s[0][0].Blue s[0][0].Green s[0][0].Red
s[0][1].Blue s[0][1].Green s[0][1].Red
s[0][2].Blue s[0][2].Green s[0][2].Red
s[0][3].Blue s[0][3].Green s[0][3].Red
s[1][0].Blue s[1][0].Green s[1][0].Red
s[1][1].Blue
If you translate the above into byte offsets from source to dest, you get:
d[0]=
&s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5
&s[0]+11 &s[0]+10 &s[0]+9
&s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7
(If you take a look at all the s[0] offsets, they match just a poseur's shuffle mask in reverse order.)
Now, we can generate a shuffle mask to map each source byte to a destination byte (X
means we don't care what that value is):
f_0_low= 3 2 1 7 6 5 11 10 9 15 14 13 X X X X
f_0_high= X X X X X X X X X X X X 3 2 1 7
f_1_low= 6 5 11 10 9 15 14 13 X X X X X X X X
f_1_high= X X X X X X X X 3 2 1 7 6 5 11 10
f_2_low= 9 15 14 13 X X X X X X X X X X X X
f_2_high= X X X X 3 2 1 7 6 5 11 10 9 15 14 13
We can further optimize this by looking the masks we use for each source pixel. If you take a look at the shuffle masks that we use for s[1]:
f_0_high= X X X X X X X X X X X X 3 2 1 7
f_1_low= 6 5 11 10 9 15 14 13 X X X X X X X X
Since the two shuffle masks don't overlap, we can combine them and simply mask off the irrelevant pixels in combine_, which we already did! The following code performs all these optimizations (plus it assumes that the source and destination addresses are 16-byte aligned). Also, the masks are written out in code in MSB->LSB order, in case you get confused about the ordering.
EDIT: changed the store to _mm_stream_si128
since you are likely doing a lot of writes and we don't want to necessarily flush the cache. Plus it should be aligned anyway so you get free perf!
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
assert((uintptr_t)orig % 16 == 0);
assert(imagesize % 16 == 0);
__m128i shuf0 = _mm_set_epi8(
-128, -128, -128, -128, // top 4 bytes are not used
13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel
__m128i shuf1 = _mm_set_epi8(
7, 1, 2, 3, // top 4 bytes go to the first pixel
-128, -128, -128, -128, // unused
13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel
__m128i shuf2 = _mm_set_epi8(
10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
-128, -128, -128, -128, // unused
13, 14, 15, 9); // bottom 4 go to third pixel
__m128i shuf3 = _mm_set_epi8(
13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
-128, -128, -128, -128); // unused
__m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
__m128i mask1 = _mm_set_epi32(0, 0, -1, -1);
__m128i mask2 = _mm_set_epi32(0, 0, 0, -1);
uint8_t *end = orig + imagesize * 4;
for (; orig != end; orig += 64, dest += 48) {
__m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
__m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
__m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
__m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);
_mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
_mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
_mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
}
}