3

Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ?

I looked at the intrinsics guide and found _mm256_adds_epi16 but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
bitwise
  • 541
  • 6
  • 16
  • as mentioned in [Is there a way to subtract packed unsigned doublewords, saturated, on x86, using MMX/SSE?](https://stackoverflow.com/q/56526082/995714) use `subus(a, b) == max(a, b) - b` with SSE4.1's `pmaxud` – phuclv Jun 10 '19 at 15:02
  • @phuclv: This question is about *signed* saturation, which is a harder problem. That link is useful for unsigned saturation, which looks similar but requires a different implementation. – Peter Cordes Jun 10 '19 at 16:23
  • See [Signed saturated add of 64-bit ints?](//stackoverflow.com/a/56531252) for scalar signed saturation with a GNU C builtin to detect signed overflow efficiently. (Using integer flags; if it did auto-vectorize it would need more instructions.) – Peter Cordes Jun 11 '19 at 19:56

3 Answers3

3

A signed overflow will happen if (and only if):

  • the signs of both inputs are the same, and
  • the sign of the sum (when added with wrap-around) is different from the input

Using C-Operators: overflow = ~(a^b) & (a^(a+b)).

Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1 trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps) this can be implemented as:

__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
    const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    // normal result (possibly wraps around)
    __m128i res      = _mm_add_epi32( a, b );

    // If result saturates, it has the same sign as both a and b
    __m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
    __m128i saturated = _mm_add_epi32(int_max, sign_bit);

    // saturation happened if inputs do not have different signs, 
    // but sign of result is different:
    __m128i sign_xor  = _mm_xor_si128( a, b );
    __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));

    return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps( res ),
                                          _mm_castsi128_ps(saturated),
                                          _mm_castsi128_ps( overflow ) ) );
}

If your blendvps is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min and int_max, with the sign-bits of a. Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow) 31 bits to the right, and manual blending (using and/andnot/or).

And naturally, with AVX2 this can take __m256i variables instead of __m128i (should be very easy to rewrite).

Addendum If you know the sign of either a or b at compile-time, you can directly set saturated accordingly, and you can save both _mm_xor_si128 calculations, i.e., overflow would be _mm_andnot_si128(b, res) for positive a and _mm_andnot(res, b) for negative a (with res = a+b).

Test case / demo: https://godbolt.org/z/v1bsc85nG

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
chtz
  • 17,329
  • 4
  • 26
  • 56
  • 1
    The other trick is that 2's complement `int_min = ~int_max`, so you can XOR with a compare result or with a `_mm_srai_epi32` result (to broadcast the sign bit) to flip max to min based on the sign of one of the inputs. See [Signed saturated add of 64-bit ints?](//stackoverflow.com/a/56531252) for a non-vectorized C version using GNU C `__builtin_saddll_overflow` to get an overflow flag result from an `add` instruction, for a branch or cmov. – Peter Cordes Jun 11 '19 at 13:37
  • 1
    `blendvps` is great on Ryzen (single uop per lane even for the VEX encoding). On Intel it's pretty bad (2p5) until Skylake, when the SSE version is 1 uop for any vector ALU (p015), and the VEX version is 2 uops (2p015) regardless of 128 vs. 256-bit. But unlike FP shuffles, FP blends have extra bypass latency between integer vector operations like add, on SnB-family. Still probably worth it vs. broadcasting the sign bit for `vpblendvb` integer byte-blend though. – Peter Cordes Jun 11 '19 at 13:40
  • 1
    You might want to use `b` as the input that determines the saturation value. People are more likely to write `x = sadd(x, 123)` than `x = sadd(123, x)`, and a compile-time constant input lets the srli / add optimize away after inlining. – Peter Cordes Jun 11 '19 at 13:47
  • 1
    @PeterCordes Regarding the operand order: True, compilers (both clang and gcc) did figure out the value of `saturated`, but noticing that both xor-operations are redundant apparently is too difficult. I added a sentence about using possible compile-time knowledge (also works, if you know only the sign, but not the magnitude of one argument). – chtz Jun 11 '19 at 15:17
  • This doesn't pass my tests… `405785285 + 833772085` shouldn't saturate, but does. – nemequ Jul 15 '21 at 17:38
  • 2
    @nemequ: Indeed, https://godbolt.org/z/cTWcK6GP8 has a test case. Inputs that should saturate don't. Reversing first 2 inputs to `_mm_blendv_ps` fixes the problem: https://godbolt.org/z/v1bsc85nG. (I edited the answer to fix the bug.) – Peter Cordes Jul 15 '21 at 19:35
1

Here is a version which works on SSE2, with improvements for SSE4.1 (_mm_blendv_ps), AVX-512VL (_mm_ternarylogic_epi32), and AVX-512DQ (_mm_movepi32_mask, on Peter Cordes' suggestion).

__m128i __mm_adds_epi32( __m128i a, __m128i b) {
  const __m128i int_max = _mm_set1_epi32(INT32_MAX);

  /* normal result (possibly wraps around) */
  const __m128i res = _mm_add_epi32(a, b);

  /* If result saturates, it has the same sign as both a and b */
  const __m128i sign_bit = _mm_srli_epi32(a, 31); /* shift sign to lowest bit */

  #if defined(__AVX512VL__)
    const __m128i overflow = _mm_ternarylogic_epi32(a, b, res, 0x42);
  #else
    const __m128i sign_xor = _mm_xor_si128(a, b);
    const __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
  #endif

  #if defined(__AVX512DQ__) && defined(__AVX512VL__)
    return _mm_mask_add_epi32(res, _mm_movepi32_mask(overflow), int_max, sign_bit);
  #else
    const __m128i saturated = _mm_add_epi32(int_max, sign_bit);

    #if defined(__SSE4_1__)
      return
        _mm_castps_si128(
          _mm_blendv_ps(
            _mm_castsi128_ps(res),
            _mm_castsi128_ps(saturated),
            _mm_castsi128_ps(overflow)
          )
        );
    #else
      const __m128i overflow_mask = _mm_srai_epi32(overflow, 31);
      return
        _mm_or_si128(
          _mm_and_si128(overflow_mask, saturated),
          _mm_andnot_si128(overflow_mask, res)
        );
    #endif
  #endif
}

I did this for SIMDe's implementation of the NEON vqaddq_s32 (and the MSA __msa_adds_s_b); if you need other versions you should be able to adapt them from simde/arm/neon/qadd.h. For 128-bit vectors, in addition to what SSE supports (8/16-bit, both signed and unsigned) there are:

  • vaddq_s32 (think _mm_adds_epi32)
  • vaddq_s64 (think _mm_adds_epi64)
  • vaddq_u32 (think _mm_adds_epu32)

vaddq_u64 (think _mm_adds_epu64) is also present, but currently relies on vector extensions. I could (and probably should) just port generated code to intrinsics, but TBH I'm not sure how to improve on it so I haven't bothered.

nemequ
  • 16,623
  • 1
  • 43
  • 62
  • With AVX-512, it might be a win to replace the blend with a merge-masked `add` of INT32_MAX + `a>>31`, merging into `r`, using a mask from ternlog -> `_mm_movepi32_mask` ([`VPMPOVD2M`](https://www.felixcloutier.com/x86/vpmovb2m:vpmovw2m:vpmovd2m:vpmovq2m)). I guess that just trades a blend for a vec->mask instruction for throughput, but VEX-coded `blendvps` costs 2 uops. As far as critical-path latency, a merge-masking `vpaddd xmm{k}, xmm, xmm` is 1 cycle latency with no extra bypass latency, but `vpmovd2m` has 3c latency. (https://uops.info/). With AVX1 vblendvps, there's ILP for the add. – Peter Cordes Jul 15 '21 at 19:54
  • I'm not sure what you're thinking about with the add; the result of the vpternlogd already has the data necessary for a vpmovd2m. llvm-mca doesn't like substituting the blend with a vpmovd2m+vmovdqa32: https://godbolt.org/z/dq5YhhrGh. It passes my tests, but throughput is a bit worse (and requires AVX-512DQ instead of just AVX-512VL). Or are you thinking about something else? Thanks for fixing the other version; I'll update my code to incorporate that soon since it's a little faster than what I have now. – nemequ Jul 15 '21 at 20:52
  • I mean merge-masking `r = _mm_mask_add_epi32(r, d2m_saturated_mask, _mm_srli_epi32(a, 31), _mm_set1_epi32(INT32_MAX))` to only ever generate the saturated value in the elements that did saturate. (That might need the mask inverted, which could be accomplished by `VPTESTNMD` with a mask if necessary.) – Peter Cordes Jul 15 '21 at 21:14
  • 1
    Ah, you're talking about in the original version. In what I had posted both values were used when calculating the mask so that wouldn't work. You're right, with that change it's slightly faster; I've updated my post with an updated version. – nemequ Jul 16 '21 at 02:43
  • Ah, I hadn't looked at the details of how you computed the overflow condition, and hadn't noticed you just updated `a` instead of defining a new var. Yeah, separating that is good for instruction-level parallelism, as well as saving uops by using the more efficient overflow detection. – Peter Cordes Jul 16 '21 at 02:47
  • Perhaps even better to use `INT_MAX ^ (a>>31)` arithmetic instead of logical right shift; some older CPUs (e.g. Intel before Skylake) can run `pxor` on more ports than `paddd`, so whatever surrounding port pressure can maybe be better balanced out. https://godbolt.org/z/Me8xdsbjj Only downside is when AVX-512 is available, compilers failing to use EVEX `vpxord` with a dword broadcast memory source the way they do for `_mm_add_epi32`. GCC uses qword, clang uses VEX-coded with a full 16-byte memory operand. – Peter Cordes Jul 16 '21 at 02:51
0

This link answers this very question:

https://software.intel.com/en-us/forums/topic/285219

Here's an example implementation:

#include <immintrin.h>

__m128i __inline __mm_adds_epi32( __m128i a, __m128i b )
{
    static __m128i int_min = _mm_set1_epi32( 0x80000000 );
    static __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    __m128i res      = _mm_add_epi32( a, b );
    __m128i sign_and = _mm_and_si128( a, b );
    __m128i sign_or  = _mm_or_si128( a, b );

    __m128i min_sat_mask = _mm_andnot_si128( res, sign_and );
    __m128i max_sat_mask = _mm_andnot_si128( sign_or, res );

    __m128 res_temp = _mm_blendv_ps(_mm_castsi128_ps( res ),
                                    _mm_castsi128_ps( int_min ),
                                    _mm_castsi128_ps( min_sat_mask ) );

    return _mm_castps_si128(_mm_blendv_ps( res_temp,
                                          _mm_castsi128_ps( int_max ),
                                          _mm_castsi128_ps( max_sat_mask ) ) );
}

void addSaturate(int32_t* bufferA, int32_t* bufferB, size_t numSamples)
{
    //
    // Load and add
    //
    __m128i* pSrc1 = (__m128i*)bufferA;
    __m128i* pSrc2 = (__m128i*)bufferB;

    for(int i=0; i<numSamples/4; ++i)
    {
        __m128i res = __mm_adds_epi32(*pSrc1, *pSrc2);
        _mm_store_si128(pSrc1, res);

        pSrc1++;
        pSrc2++;
    }
}
phuclv
  • 37,963
  • 15
  • 156
  • 475
bitwise
  • 541
  • 6
  • 16
  • You should probably put a real example here since links only questions are not recommended (if page disappear or url has changed this make the answer useless...) – tigrou Apr 08 '15 at 18:45
  • 1
    Sure. Will do after lunch ;) – bitwise Apr 08 '15 at 19:11
  • 1
    `int_min = int_max + 1`, so this would be more efficient with `tmp = _mm_srli_epi32(a, 1)` and `_mm_add_epi32(max, tmp)` to select the saturation limit. (A non-negative `a` can only overflow at the high end; even the most-negative `b` can't overflow that way.) – Peter Cordes Jun 10 '19 at 16:39