5

I've been looking at MMX/SSE and I am wondering. There are instructions for packed, saturated subtraction of unsigned bytes and words, but not doublewords.

Is there a way of doing what I want, or if not, why is there none?

phuclv
  • 37,963
  • 15
  • 156
  • 475
z0rberg's
  • 674
  • 5
  • 10
  • 4
    You can use compare and mask. As to why it doesn't exist as a single instruction, it's anybody's guess. – Jester Jun 10 '19 at 12:15
  • I don't understand. How would I do that? – z0rberg's Jun 10 '19 at 13:02
  • 1
    Are your source values signed or unsigned ? It’s fairly easy if the inputs are unsigned, slightly trickier if they are signed. – Paul R Jun 10 '19 at 13:21
  • 1
    Check out how LLVM auto-vectorizes Rust `u32.saturating_sub()`: https://godbolt.org/z/huP4PX - range-shift to signed with PXOR, then PCMPGTD signed-compare, then AND/ANDN/OR to apply saturation to a PSUBD result. I'm not sure this is optimal; it should just need PAND because the only saturation case for unsigned subtraction is saturation to 0. – Peter Cordes Jun 10 '19 at 13:21
  • 7
    You can use `subus(a, b) == max(a, b) - b` - E: well that's good with SSE4.1, does MMX/SSE mean literally only MMX and SSE? – harold Jun 10 '19 at 13:22
  • 1
    @harold: oh yes, that's very good with SSE4.1 for `pmaxud`. – Peter Cordes Jun 10 '19 at 13:24
  • wow, these blow up my budget. i was asking my question, because i wanted to avoid cmp at all.PaulR unsigned, as stated in the Q. I guess SSE4 would be fine? I'd have to check. Peter, @harold thank you for your suggestions, i will look into the performance of these. – z0rberg's Jun 10 '19 at 15:40
  • I don't know which of your commentswould be the best answer... – z0rberg's Jun 10 '19 at 15:43
  • @z0rberg's The best solution depends on some context, e.g., do you care about throughput, latency, portability, etc. – chtz Jun 11 '19 at 13:21
  • @chtz throughput only. i haven't yet had the time to sit down with the suggestions. – z0rberg's Jun 11 '19 at 18:35

1 Answers1

3

If you have SSE4.1 available, I don't think you can get better than using the pmaxud+psubd approach suggested by @harold. With AVX2, you can of course also use the corresponding 256bit variants.

__m128i subs_epu32_sse4(__m128i a, __m128i b){
    __m128i mx = _mm_max_epu32(a,b);
    return _mm_sub_epi32(mx, b);
}

Without SSE4.1, you need to compare both arguments in some way. Unfortunately, there is no epu32 comparison (not before AVX512), but you can simulate one by first adding 0x80000000 (which is equivalent to xor-ing in this case) to both arguments:

__m128i cmpgt_epu32(__m128i a, __m128i b) {
    const __m128i highest = _mm_set1_epi32(0x80000000);
    return _mm_cmpgt_epi32(_mm_xor_si128(a,highest),_mm_xor_si128(b,highest));
}

__m128i subs_epu32(__m128i a, __m128i b){
    __m128i not_saturated = cmpgt_epu32(a,b);
    return _mm_and_si128(not_saturated, _mm_sub_epi32(a,b));
}

In some cases, it might be better to replace the comparison by some bit-twiddling of the highest bit and broadcasting that to every bit using a shift (this replaces a pcmpgtd and three bit-logic operations (and having to load 0x80000000 at least once) by a psrad and five bit-logic operations):

__m128i subs_epu32_(__m128i a, __m128i b) {
    __m128i r = _mm_sub_epi32(a,b);
    __m128i c = (~a & b) | (r & ~(a^b)); // works with gcc/clang. Replace by corresponding intrinsics, if necessary (note that `andnot` is a single instruction)
    return _mm_srai_epi32(c,31) & r;
}

Godbolt-Link, also including adds_epu32 variants: https://godbolt.org/z/n4qaW1 Strangely, clang needs more register copies than gcc for the non-SSE4.1 variants. On the other hand, clang finds the pmaxud optimization for the cmpgt_epu32 variant when compiled with SSE4.1: https://godbolt.org/z/3o5KCm

chtz
  • 17,329
  • 4
  • 26
  • 56