If speed is your concern, you should avoid going through floating point domain and work solely with fixed point (and using 8/16-bit arithmetic).
Even though the (double precision) floating point factors are short in base 10, they are not that short in base 2:
0.393 = 3.93000000000000015987211554602E-1 == 0x3FD926E978D4FDF4
0.168 = 1.68000000000000010436096431476E-1 == 0x3FC5810624DD2F1B
etc.
Given that the original integer r,g,b data is limited to range 0..255 the distant right bits in the factors do not contribute. Thus we may just be truncate or round the binary representation as well.
If 7 bits of coefficient precision is enough, we could come up with the coefficient matrix of
50 98 24 == 0x32 0x62 0x18
45 88 22 == 0x2d 0x58 0x16
35 68 17 == 0x23 0x44 0x11
7-bits, because the fastest way to compute small dot products in SSE is _mm_maddubs_epi16
, which can multiply uint8_t RGB with 8-bit signed (or 7-bit unsigned) coefficients.
Then we need to arrange the input and the coefficient matrices properly.
Option 1: interleaved
R0G0B0R1G1B1R2G2B2R3G3B3R4G4B4R5G5B5R6G6B6...
Option 2: Planar:
R0R1R2R3... + G0G1G2G3... + B0B1B2B3....
In either way the target is to reshuffle the data to
xmm0 = R0G0R1G1R2G2R3G3R4G4R5G5R6G6R7G7
xmm1 = B0xxB1xxB2xxB3xxB4xxB5xxB6xxB7xx
rg0 = 326232623262...
b0. = 180018001800...
r_new_0 = _mm_maddubs_epi16(xmm0, rg0);
g_new_0 = _mm_maddubs_epi16(xmm0, rg1);
b_new_0 = _mm_maddubs_epi16(xmm0, rg2);
r_new_1 = _mm_maddubs_epi16(xmm1, b0);
g_new_1 = _mm_maddubs_epi16(xmm1, b1);
b_new_1 = _mm_maddubs_epi16(xmm1, b2);
r_new_0 = _mm_add_epi16(r_new_0, r_new_1);
g_new_0 = _mm_add_epi16(g_new_0, g_new_1);
b_new_0 = _mm_add_epi16(b_new_0, b_new_1);
Then we need to shift right by 7 and convert to uint8_t. This conversion needs saturation, since the sums of the coefficients in each column are larger than 128.
r_new_0 = _mm_srli_epi16(r_new_0, 7);
r_new_0 = _mm_packus_epi16(r_new_0, r_new_0);
... and same for g_new_0, b_new_0
This final step shows a very small inefficiency, since one half of the register capacity is lost; consuming 24 bytes of input we produced 8+8+8 outputs.
It's probably anyway better to start working with 16+16+16 input bytes, which leads to 12 multiplication, the first being complete well in time for the additions.
If you absolutely insist on using floats, I would use something like this for interleaved data (as in your case):
auto data_ptr = reinterpret_cast<const uint32_t *>(pixels);
__m128i rgbi = _mm_cvtsi32_si128(*data_ptr);
#if SSE4_ENABLED
rgbi = _mm_cvtepu8_epi32(rgbi);
#elif SSE3_ENABLED
auto const k0123 = _mm_set_epi32(-1,2,1,0);
rgbi = _mm_shuffle_epi32(rgbi, k0123);
#else
rgbi = _mm_unpacklo_epi8(rgbi, _mm_setzero_si128());
rgbi = _mm_unpacklo_epi16(rgbi, _mm_setzero_si128());
#endif
// having expanded 3 x uint8_t -> 3 x int32_t + garbage
auto rgb_f = _mm_cvtepi32_ps(rgbi);
// shuffle the rgb to gbr and brg
auto gbr_f = _mm_shuffle_ps(rgb_f, rgb_f, 0b00001001);
auto brg_f = _mm_shuffle_ps(rgb_f, rgb_f, 0b00010010);
// now we multiply the permuted rgb vectors with
// permuted coefficients
rgb_f = _mm_mul_ps(rgb_f, coeffs0);
gbr_f = _mm_mul_ps(gbr_f, coeffs1);
brg_f = _mm_mul_ps(brg_f, coeffs2);
// sum up vertically
// for rgb_f[0] = R, rgb_f[1] = G, rgb_f[2] = B
rgb_f = _mm_add_ps(rgb_f, gbr_f);
rgb_f = _mm_add_ps(rgb_f, brg_f);
rgbi = _mm_cvtepi32_ps(rgb_f); // back to 32-bit integer
Then convert back to uint8_t while saturating -- SSE4.1 has _mm_packus_epi32
and SSE2 has _mm_packus_epi16
, which can be used instead, because the expected range of rgbi
is just between 0 and 345, thus fitting in int16_t
. But using packus_epi16
unfortunately leaves zeros in between the consecutive output lanes, which is difficult to reshuffle without _mm_shuffle_epi8
which is available only starting from SSSE3.
We see anyway, that pre-arranging helps to remove horizontal accumulation, but also we see that we lose some 25% of computation power for not using the lane #3 and spend time in shuffling. The layout of the input should be revised...