9

I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2).

The most common case I'm working on is comparing for > 0U, e.g.

_mm_cmpgt_epu8(v, _mm_setzero_si128())                // #1

(which of course can also be considered to be a simple test for non-zero.)

But I'm also somewhat interested in the more general case, e.g.

_mm_cmpgt_epu8(v1, v2)                                // #2

The first case can be implemented with 2 instructions, using various different methods, e.g. compare with 0 and then invert the result. The second case typically requires 3 instructions, e.g. subtract 128 from both operands and perform signed compare. (See this question for various 3 instruction solutions.)

What I'm looking for ideally is a single instruction solution for #1, and a two instruction solution for #2. If neither of these are possible then I'm also interested in thoughts as to which of the various possible 2 or 3 instruction implementations is most efficient on modern Intel CPUs (Sandy Bridge, Ivy Bridge, Haswell).


Best implementations for case #2 so far:

    1. compare equal with unsigned max and invert result:

#define _mm_cmpgt_epu8(v0, v1) \ _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(v0, v1), v1), \ _mm_set1_epi8(-1))

Two arithmetic instructions + one bitwise = 1.33 throughput.

    1. invert sign bits for both arguments (== subtract 128) and use signed compare:

#define _mm_cmpgt_epu8(v0, v1) \ _mm_cmpgt_epi8(_mm_xor_si128(v0, _mm_set1_epi8(-128)), \ _mm_xor_si128(v1, _mm_set1_epi8(-128)))

One arithmetic instruction + two bitwise = 1.16 throughput.


Best implementations for case #1, derived from case #2 implementations above:

  • 1.

#define _mm_cmpgtz_epu8(v0) \ _mm_andnot_si128(_mm_cmpeq_epi8(v0, _mm_set1_epi8(0)), \ _mm_set1_epi8(-1))

One arithmetic instruction + one bitwise = 0.83 throughput.

  • 2.

#define _mm_cmpgtz_epu8(v0) \ _mm_cmpgt_epi8(_mm_xor_si128(v0, _mm_set1_epi8(-128)), \ _mm_set1_epi8(-128)))

One arithmetic instruction + one bitwise = 0.83 throughput.

Community
  • 1
  • 1
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 2
    Obviously there's "xor instead of subtract", with a slightly better throughput.. other than that I got nothing and one instruction for #1 seems very optimistic to me – harold Nov 20 '15 at 10:48
  • @harold: yes, the bitwise instructions are 0.33 throughput whereas arithmetic tends to be 0.5, so bitwise is to be preferred where possible. I'm stumped myself, hence the question - it's just a last-ditch hope that there might be something simple that I've missed. – Paul R Nov 20 '15 at 10:52
  • 1
    A) Non-strict version of #2 (aka *ge*) can be done with [your own idea](http://stackoverflow.com/a/32945715/556899) in 2 instructions. B) One instruction `_mm_subs_epu8` is enough to reduce problem #2 to problem #1. C) The question can be solved once and for all by using [superoptimization](https://en.wikipedia.org/wiki/Superoptimization). I think it is possible to perform bruteforce search over all sequences of 2 SSE instructions with only 2 registers and a limited set of constants (e.g. the ones produced by `_mm_set1_epi8`). – stgatilov Nov 20 '15 at 12:19
  • Thanks for the comments - not sure what you mean by *"A) Non-strict version of #2 (aka ge) can be done with your own idea in 2 instructions."* though ? The super-optimization idea is interesting - I've dabbled with this before of PowerPC scalar bit-twiddlling hacks. – Paul R Nov 20 '15 at 13:49
  • 2
    Intel needs to stop dangling AVX512 in front of our faces and pulling it away like a toy in front of a cat. – Mysticial Nov 20 '15 at 15:20
  • @Mysticial: yes, `vpcmpub` would be handy right now. Do you know whether Knights Landing is actually shipping now ? It was due Q2 of 2015 IIRC, and there's not much 2015 left... – Paul R Nov 20 '15 at 15:24
  • @PaulR I'm hearing an unconfirmed Q1 2016. Unfortunately for me, I don't expect the host processor version to be within my personal budget. (I don't expect either version to be within my budget.) – Mysticial Nov 20 '15 at 16:08
  • @Mysticial: 2016 certainly sounds plausible, given how quiet Intel has been on the subject of KL lately. I'm hoping to get access to some KL hardware when they do finally start to ship, although it will probably only be for experimental purposes while waiting for Purley. – Paul R Nov 20 '15 at 16:54
  • 2
    AMD provide this ages ago with XOP. – Z boson Nov 20 '15 at 21:48
  • Are you targetting recent-Intel with VEX-encoded instructions? If not, then some sequences will take an extra uop for a `mov` instruction. Is this in a loop where getting any needed masks into registers is cheap? (e.g. set1(-1) compiles to `pcmpeq same,same`, if it doesn't turn into a load from memory. But set1(-128) would take multiple instructions to generate, so is more likely to turn into a load.) – Peter Cordes Nov 20 '15 at 22:33
  • 1
    @PeterCordes: yes, this is typically in a sequence of 8 or 16 loop iterations, so any masks or constants can be assumed to be "free". CPU is Intel-only, Sandy Bridge onwards. – Paul R Nov 20 '15 at 22:37
  • What do you need to do with the result of the comparison? I assume the mask can't be inverted, because your solution for case#1 takes the extra instruction to invert. And I guess throughput is more important than latency, since you're focusing on that. A 3-insn sequence could have 2 or 3 cycle latency, depending on whether the first two insns depend on each other or are independent. – Peter Cordes Nov 20 '15 at 23:20
  • @PeterCordes: yes, unfortunately the resulting mask needs to retain the same sense and for now at least it has to be 255 or 0. Latency shouldn't be too much of a problem - there are enough other instructions in the loop without dependencies that we should be able to hide 2 - 3 cycles without difficulty. – Paul R Nov 20 '15 at 23:48

3 Answers3

5

There is an example from Simd Library:

    const __m128i K_INV_ZERO = SIMD_MM_SET1_EPI8(0xFF);//_mm_set1_epi8(-1);

    SIMD_INLINE __m128i NotEqual8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(a, b), K_INV_ZERO);
    }

    SIMD_INLINE __m128i Greater8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_min_epu8(a, b), a), K_INV_ZERO);
    }

    SIMD_INLINE __m128i GreaterOrEqual8u(__m128i a, __m128i b)
    {
        return _mm_cmpeq_epi8(_mm_max_epu8(a, b), a);
    }

    SIMD_INLINE __m128i Lesser8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(a, b), a), K_INV_ZERO);
    }

    SIMD_INLINE __m128i LesserOrEqual8u(__m128i a, __m128i b)
    {
        return _mm_cmpeq_epi8(_mm_min_epu8(a, b), a);
    }
ErmIg
  • 3,980
  • 1
  • 27
  • 40
  • Thanks, yes, I already have a similar implementation with three instructions for `_mm_cmpgt_epu8` (`Greater8u` in the example above). It looks like the best I can do so far is two arithmetic instructions at 0.5 throughput + one bitwise at 0.33 throughput for a total of 1.33. I'm hoping there's a better solution, even if it's still three instructions (e.g. one arithmetic and two bitwise). – Paul R Nov 20 '15 at 14:51
2

In the spirit of copying code from SIMD libraries here is how Agner Fog's Vector Class Library (C++) does it:

// vector operator >= : returns true for elements for which a >= b (unsigned)
static inline Vec16cb operator >= (Vec16uc const & a, Vec16uc const & b) {
#ifdef __XOP__  // AMD XOP instruction set
    return _mm_comge_epu8(a,b);
#else  // SSE2 instruction set
    return _mm_cmpeq_epi8(_mm_max_epu8(a,b),a); // a == max(a,b)
#endif
}

// vector operator <= : returns true for elements for which a <= b (unsigned)
static inline Vec16cb operator <= (Vec16uc const & a, Vec16uc const & b) {
    return b >= a;
}

// vector operator > : returns true for elements for which a > b (unsigned)
static inline Vec16cb operator > (Vec16uc const & a, Vec16uc const & b) {
#ifdef __XOP__  // AMD XOP instruction set
    return _mm_comgt_epu8(a,b);
#else  // SSE2 instruction set
    return Vec16cb(Vec16c(~(b >= a)));
#endif
}

// vector operator < : returns true for elements for which a < b (unsigned)
static inline Vec16cb operator < (Vec16uc const & a, Vec16uc const & b) {
    return b > a;
}

// vector operator ~ : bitwise not
static inline Vec16uc operator ~ (Vec16uc const & a) {
    return Vec16uc( ~ Vec128b(a));
}

where bitwise not is defined as

// vector operator ~ : bitwise not
static inline Vec128b operator ~ (Vec128b const & a) {
    return _mm_xor_si128(a, _mm_set1_epi32(-1));
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    Thanks - I didn't know AMD had 8 bit unsigned compares in XOP. I'm targetting Intel only at present so unfortunately it doesn't help in this instance, but it's useful to know for future reference. – Paul R Nov 20 '15 at 22:34
  • @PaulR, yeah, XOP fills in many (but not all) of the missing basic integer SIMD operations that the scalar instructions have. It's a real shame Intel did not implement these already (where's my AVX512!!!???). It's almost like they did it out of spite not to give AMD credit and punish the consumer. It makes no sense to me that they have gotten away with [this](http://www.agner.org/optimize/blog/read.php?i=49) for so long. The irony is that maybe the reason we don't have AVX512 now is maybe they want to let AMD catch up. But if Xen removes XOP but notes not give AVX512 it will be a step back. – Z boson Nov 21 '15 at 07:48
  • Yes, I still miss the good old days of PowerPC and AltiVec - much more orthogonal SIMD, without all the gaps. – Paul R Nov 21 '15 at 09:11
  • @PaulR, IBM is releasing a system with PowerPC and an FPGA. That's quite interesting to me. Some of my colleagues work with FPGAs and they have become quite popular e.g. at CERN and in financial trading. So for HPC it will be a batlle between Xeon Phi, GPUs, and FPGAs. – Z boson Nov 21 '15 at 09:31
  • Yes there is also IBM POWER 7, 8, 9 which all have essentially AltiVec SIMD. Very expensive but good for high end server/HPC stuff. – Paul R Nov 21 '15 at 09:35
  • 1
    @PaulR, congrats on earning the gold SSE badge (for that matter being the first and only to the silver badge as well)! – Z boson Nov 23 '15 at 10:14
  • 1
    Thanks - I'm feeling somewhat unique and special today (which is unusual for a Monday morning) ! ;-) – Paul R Nov 23 '15 at 10:35
2

I had an idea for doing >= in two instructions:

  • subtract with unsigned saturation
  • compare with zero

It doesn't help for >, though.

Also, it's pretty much equivalent to ErmIg's SIMD Library answer (max_epu8(a,b) -> cmpeq with a), but worse because it needs a zeroed register. This works for SSE2, instead of SSE4.1, though. psubusb runs on the same ports as pminusb.


A previous version of this answer had the mistaken idea that b-a has the sign bit set iff a>b. But it's actually one bit to the left of that which needs testing: the carry flag / bit (which doesn't exist for packed-integer SIMD).

See the edit history for some ideas about broadcasting the sign bit to the rest of the element with pshufb (negated result) or pblendvb (which may be single-uop on Skylake for the non-VEX version only).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks Peter - I'm going to plug the various different ideas into a test harness and benchmark on SB/IB/Haswell to see if there is a clear winner. I'm using this stuff in a fairly critical chunk of code so even if we can save a fraction of a cycle it will be a win. – Paul R Nov 21 '15 at 09:14
  • 1
    @PaulR: That's probably the best bet, to benchmark the entire loop. Code alignment apparently still sometimes matters even with the uop loop buffer, and execution port pressure will depend on context. – Peter Cordes Nov 21 '15 at 09:23
  • Yes indeed, and there is other code around these comparisons that I haven't included, so the eventual scheduling etc may play out differently than expected in practice. If the benchmark data looks interesting I'll summarise it as an answer. – Paul R Nov 21 '15 at 09:27
  • 1
    @PaulR: unless the rest of your code has trouble keeping port0 saturated, I predict both versions will perform essentially identically. The one that only uses `set1(-1)` will have less overhead setting up its constant (since it's one ALU op to generate a vector of all ones, instead of a load.) – Peter Cordes Nov 21 '15 at 09:35
  • 2
    plus one for thinking about the problem rather than just posting code. – Z boson Nov 21 '15 at 14:36