Multiplying and adding float numbers

Question

I have a task to convert some c++ code to asm and I wonder if what I am thinking makes any sense. First I would convert integers to floats. I would like to get array data to sse register, but here is problem, because I want only 3 not 4 integers, is there way to overcome that? Then I would convert those integers to floats using CVTDQ2PS and I would save those numbers in memory. For the const numbers like 0.393 I would make 3 vectors of floats and then I would do same operation three times so I will think about sepiaRed only. For that I would get my converted integers into sse register and I would multiply those numbers which would give me the result in xmm0 register. Now how can I add them together?

I guess my two questions are: how can I get 3 items from array to sse register, so that I can avoid any problems. And then how can I add three numbers in xmm0 register together.

    tmpGreen = (float)pixels[i + 1];
    tmpRed = (float)pixels[i + 2];
    tmpBlue = (float)pixels[i];

    sepiaRed = (int)(0.393 * tmpRed + 0.769 * tmpGreen + 0.189 * tmpBlue); //red
    sepiaGreen = (int)(0.349 * tmpRed + 0.686 * tmpGreen + 0.168 * tmpBlue); //green
    sepiaBlue = (int)(0.272 * tmpRed + 0.534 * tmpGreen + 0.131 * tmpBlue); //blue

If you want to convert C++ to assembly, you can use a compiler. Do you actually want to convert this to SIMD-optimized code? Can you provide a [mre]? (E.g., the data types are important, also what happens with the results?) And what target architecture do you have (just plain SSE2)? — chtz, Jan 08 '22 at 22:52

Aki Suihkonen · Answer 1 · 2022-01-09T11:40:10.327

If speed is your concern, you should avoid going through floating point domain and work solely with fixed point (and using 8/16-bit arithmetic).

Even though the (double precision) floating point factors are short in base 10, they are not that short in base 2:

0.393 = 3.93000000000000015987211554602E-1 == 0x3FD926E978D4FDF4
0.168 = 1.68000000000000010436096431476E-1 == 0x3FC5810624DD2F1B
etc.

Given that the original integer r,g,b data is limited to range 0..255 the distant right bits in the factors do not contribute. Thus we may just be truncate or round the binary representation as well.

If 7 bits of coefficient precision is enough, we could come up with the coefficient matrix of

50   98   24 == 0x32 0x62 0x18
45   88   22 == 0x2d 0x58 0x16
35   68   17 == 0x23 0x44 0x11

7-bits, because the fastest way to compute small dot products in SSE is _mm_maddubs_epi16, which can multiply uint8_t RGB with 8-bit signed (or 7-bit unsigned) coefficients.

Then we need to arrange the input and the coefficient matrices properly.

Option 1: interleaved

R0G0B0R1G1B1R2G2B2R3G3B3R4G4B4R5G5B5R6G6B6...

Option 2: Planar:

R0R1R2R3...   + G0G1G2G3...     + B0B1B2B3....

In either way the target is to reshuffle the data to

xmm0 = R0G0R1G1R2G2R3G3R4G4R5G5R6G6R7G7
xmm1 = B0xxB1xxB2xxB3xxB4xxB5xxB6xxB7xx

rg0  = 326232623262...
b0.  = 180018001800...

r_new_0 = _mm_maddubs_epi16(xmm0, rg0);
g_new_0 = _mm_maddubs_epi16(xmm0, rg1);
b_new_0 = _mm_maddubs_epi16(xmm0, rg2);

r_new_1 = _mm_maddubs_epi16(xmm1, b0);
g_new_1 = _mm_maddubs_epi16(xmm1, b1);
b_new_1 = _mm_maddubs_epi16(xmm1, b2);

r_new_0 = _mm_add_epi16(r_new_0, r_new_1);
g_new_0 = _mm_add_epi16(g_new_0, g_new_1);
b_new_0 = _mm_add_epi16(b_new_0, b_new_1);

Then we need to shift right by 7 and convert to uint8_t. This conversion needs saturation, since the sums of the coefficients in each column are larger than 128.

 r_new_0 = _mm_srli_epi16(r_new_0, 7);
 r_new_0 = _mm_packus_epi16(r_new_0, r_new_0);
 ... and same for g_new_0, b_new_0

This final step shows a very small inefficiency, since one half of the register capacity is lost; consuming 24 bytes of input we produced 8+8+8 outputs.

It's probably anyway better to start working with 16+16+16 input bytes, which leads to 12 multiplication, the first being complete well in time for the additions.

If you absolutely insist on using floats, I would use something like this for interleaved data (as in your case):

    auto data_ptr = reinterpret_cast<const uint32_t *>(pixels);

    __m128i rgbi = _mm_cvtsi32_si128(*data_ptr);
#if SSE4_ENABLED
    rgbi = _mm_cvtepu8_epi32(rgbi);
#elif SSE3_ENABLED
    auto const k0123 = _mm_set_epi32(-1,2,1,0);
    rgbi = _mm_shuffle_epi32(rgbi, k0123);
#else
    rgbi = _mm_unpacklo_epi8(rgbi, _mm_setzero_si128());
    rgbi = _mm_unpacklo_epi16(rgbi, _mm_setzero_si128());
#endif
    // having expanded 3 x uint8_t -> 3 x int32_t + garbage
    auto rgb_f = _mm_cvtepi32_ps(rgbi);

    // shuffle the rgb to gbr and brg
    auto gbr_f = _mm_shuffle_ps(rgb_f, rgb_f, 0b00001001);
    auto brg_f = _mm_shuffle_ps(rgb_f, rgb_f, 0b00010010);

    // now we multiply the permuted rgb vectors with
    // permuted coefficients
    rgb_f = _mm_mul_ps(rgb_f, coeffs0);
    gbr_f = _mm_mul_ps(gbr_f, coeffs1);
    brg_f = _mm_mul_ps(brg_f, coeffs2);

    // sum up vertically
    // for rgb_f[0] = R, rgb_f[1] = G, rgb_f[2] = B
    rgb_f = _mm_add_ps(rgb_f, gbr_f);
    rgb_f = _mm_add_ps(rgb_f, brg_f);

    rgbi = _mm_cvtepi32_ps(rgb_f); // back to 32-bit integer

Then convert back to uint8_t while saturating -- SSE4.1 has _mm_packus_epi32 and SSE2 has _mm_packus_epi16, which can be used instead, because the expected range of rgbi is just between 0 and 345, thus fitting in int16_t. But using packus_epi16 unfortunately leaves zeros in between the consecutive output lanes, which is difficult to reshuffle without _mm_shuffle_epi8 which is available only starting from SSSE3.

We see anyway, that pre-arranging helps to remove horizontal accumulation, but also we see that we lose some 25% of computation power for not using the lane #3 and spend time in shuffling. The layout of the input should be revised...

It would almost work to use `packus_epi16` twice, but the problem is that e.g. 0x00000111 would convert first to `0xff 00`, but that would be a negative number for the second pass saturating to zero. `Packs_epi16` would not work either without sign extending: `a=_mm_packus_epi16(a); a = _mm_shli_epi16(a,8); a= _mm_srai_epi16(a, 8); a = _mm_packs_epi16(a);` would (probably) work for SSE2. — Aki Suihkonen, Jan 09 '22 at 11:46
Pack instructions always treat their input as signed, so for packing 32-bit to u8, you want `packs_epi32` / `packus_epi16`. But the 32-bit input is signed, so `UINT_MAX` is `-1`, packing to 16-bit `-1`, and then to 8-bit `0` not `255` for example. 2x `packsswd` / 1x `packuswb` does correctly saturate int32 -> u8. It's usable for uint32_t input if they're non-negative int32_t, i.e. <= INT_MAX, else you'd have to mask off the high bit or something. (That might be related to why `packusdw` didn't appear until SSE4.1; it's only useful for a u16 final result, not on the way to u8) — Peter Cordes, Jan 09 '22 at 17:19

score 1 · Accepted Answer · answered Jan 08 '22 at 22:42

You can't easily horizontally add 3 numbers together; Fastest way to do horizontal SSE vector sum (or other reduction)

What you can efficiently do is map 4 pixels in parallel, with vectors of 4 reds, 4 greens, and 4 blues. (Which you'd want to load from planar, not interleaved, pixel data. struct of arrays, not array of structs.)

You might be able to get some benefit for doing a single pixel at once, though, if you just load 4 ints with movdqu and use a multiplier of 0.0 for the high element after cvtdq2ps. Then you can do a normal horizontal sum of 4 elements instead of having to adjust it. (Hmm, although doing 3 would let you do the 2nd shuffle in parallel with the first add, instead of after.)

Using SIMD inefficiently loses some of the benefit; see guides in https://stackoverflow.com/tags/sse/info especially https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ re: how people often try to use one SIMD vector to hold one x,y,z geometry vector, and then find that SIMD didn't help much.

Multiplying and adding float numbers

2 Answers2

Linked