2

Reading from the official Intel C++ Intrinsic Reference,

SSE 2 has the following command

__m128i _mm_madd_epi16(__m128i a, __m128i b)

Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Adds the signed 32-bit integer results pairwise and packs the 4 signed 32-bit integer results.

while SSE 3 has

__m128i _mm_maddubs_epi16 (__m128i a, __m128i b)

Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed words.

Since Im working with 8bit pixels and I must only use SSE 2(old architecture is the target) I need an 8bit madd instruction. How would I proceed with that?

adkalkan
  • 69
  • 1
  • 7

1 Answers1

2

Hope this works - I don't have a compiler here. But even if I missed something, you should get the overall idea.

EDIT: Thanks to @Peter Cordes for pointing out that _mm_setzero_si128 should better be used directly.

inline __m128i _mm_madd_epi8_SSE2(const __m128i & a, const __m128i & b)
{
    // a = 0x00 0x01 0xFE 0x04 ...
    // b = 0x00 0x02 0x80 0x84 ...

    // To extend signed 8-bit value, MSB has to be set to 0xFF
    __m128i sign_mask_a  = _mm_cmplt_epi8(a, _mm_setzero_si128());
    __m128i sign_mask_b  = _mm_cmplt_epi8(b, _mm_setzero_si128());

    // sign_mask_a = 0x00 0x00 0xFF 0x00 ...
    // sign_mask_b = 0x00 0x00 0xFF 0xFF ...

    // Unpack positives with 0x00, negatives with 0xFF
    __m128i a_epi16_l    = _mm_unpacklo_epi8(a, sign_mask_a);
    __m128i a_epi16_h    = _mm_unpackhi_epi8(a, sign_mask_a);
    __m128i b_epi16_l    = _mm_unpacklo_epi8(b, sign_mask_b);
    __m128i b_epi16_h    = _mm_unpackhi_epi8(b, sign_mask_b);

    // Here - valid 16-bit signed integers corresponding to the 8-bit input
    // a_epi16_l = 0x00 0x00 0x01 0x00 0xFE 0xFF 0x04 0x00 ... 

    // Get the a[i] * b[i] + a[i+1] * b[i+1] for both low and high parts
    __m128i madd_epi32_l = _mm_madd_epi16(a_epi16_l, b_epi16_l);
    __m128i madd_epi32_h = _mm_madd_epi16(a_epi16_h, b_epi16_h);

    // Now go back from 32-bit values to 16-bit values & signed saturate
    return _mm_packs_epi16(madd_epi32_l, madd_epi32_h);
}
kikobyte
  • 334
  • 1
  • 10
  • For integer compares, there's only `cmpeq` and `cmpgt` (until AVX512 gives us an integer compare instruction that takes an immediate byte to choose the predicate, like `cmpps`). We can work around it with `_mm_cmpgt_epi8(_mm_setzero_si128(), a)` to produce the upper halves for sign-extension. – Peter Cordes Jul 06 '16 at 15:27
  • 1
    Also, don't use `static ... = _mm_setzero_si128();` For gcc, `_mm_setzero` isn't really a compile-time constant, so you'll get code that zeros some memory at run-time. Plus, you don't want to load all-zeros from memory anyway; `pxor xmm1,xmm1` is cheaper than a load. Just use `_mm_setzero_si128()` directly, or assign it to a non-static `const __m128i` local variable. Same for `_mm_set_epi8(some data)`: just let the compiler worry about managing constants. – Peter Cordes Jul 06 '16 at 15:33
  • Thanks for making me look into the `_mm_setzero_si128` instruction. Indeed, it's a `pxor` with latency 1 and throughput 0.33. Cannot agree on `_mm_cmplt_epi8` [availability](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmplt_epi8&techs=SSE2) though: Although effectively both `_mm_cmplt_epi8` and `_mm_cmpgt_epi8` indeed resort to the same `pcmpgtb` instruction, I don't see any reason why you can't use either of those. – kikobyte Jul 07 '16 at 08:22
  • 1
    OP says he is working with "8 bit pixels", which would presumably be unsigned. Hence you should just be able to do an unsigned unpack (i.e. just interleave with 0 in the high byte). – Paul R Jul 07 '16 at 08:33
  • 2
    For RGB color model that would be true, but for Y'CbCr there are two schemes for energy mapping: Unsigned Y´, Two's Complement Cb, Cr and Unsigned Y´, Offset Binary Cb, Cr, which use signed and unsigned data types correspondingly. Since OP mentioned only `epi16` routines in his post, I assumed the `madd` he needs shall be `epi8`, not `epu8`. – kikobyte Jul 07 '16 at 09:04
  • @kikobyte: oh, I never noticed there were `cmplt` intrinsics. There's nothing wrong with it, I just thought it was an error since neither of us had tried compiling it :P. I usually think about what asm I want the compiler to generate, and work backwards from that to intrinsics, so I normally look at the asm instruction reference manual, not the intrinsics guide. – Peter Cordes Jul 07 '16 at 14:27
  • 2
    And re: xor-zeroing: `pxor same,same` is a [zeroing idiom recognized by the CPU](http://stackoverflow.com/questions/33666617/which-is-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and/33668295#33668295), so it doesn't really have a meaningful latency: it's always the start of a new dependency chain. And on Intel SnB-family CPUs, it doesn't even take an execution unit, so it has one per 0.25 cycle throughput. (It's handled in the register-rename stage, by renaming to a physical zeroed register). – Peter Cordes Jul 07 '16 at 14:30
  • @Peter Cordes Actually, complier should know that zeroing a register with pxor is much easier than loading it. It is not necessary to write it into the same line. – FERcsI Feb 28 '21 at 22:22
  • Yes, hopefully a compiler would optimize away the `static const __m128i zero = _mm_setzero_si128();`, but it's easier and better not to tempt the compiler to do it wrong. It's not bad to define a `const __m128i zero= ...`, but the problem is making it `static`. You *don't* want static, that implies static storage. – Peter Cordes Mar 01 '21 at 01:11