12

If I want to do a bitwise equality test between two __m128i variables, am I required to use an SSE instruction or can I use ==? If not, which SSE instruction should I use?

jww
  • 97,681
  • 90
  • 411
  • 885
jaynp
  • 3,275
  • 4
  • 30
  • 43

3 Answers3

14

Although using _mm_movemask_epi8 is one solution, if you have a processor with SSE4.1 I think a better solution is to use an instruction which sets the zero or carry flag in the FLAGS register. This saves a test or cmp instruction.

To do this you could do this:

if(_mm_test_all_ones(_mm_cmpeq_epi8(v1,v2))) {
    //v0 == v1
}

Edit: as Paul R pointed out _mm_test_all_ones generates two instructions: pcmpeqd and ptest. With _mm_cmpeq_epi8 that's three instructions total. Here's a better solution which only uses two instructions in total:

__m128i neq = _mm_xor_si128(v1,v2);
if(_mm_test_all_zeros(neq,neq)) {
    //v0 == v1
}

This generates

pxor    %xmm1, %xmm0
ptest   %xmm0, %xmm0
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Note that `_mm_test_all_ones` is a macro which generates two instructions: `_mm_cmpeq_epi32` and `_mm_testc_si128`, so you have a total of three SSE instructions in your solution. It would be interesting to benchmark this against the "old skool" implementation with `_mm_movemask_epi8` though. – Paul R Nov 12 '14 at 13:30
  • @PaulR, good point, I have not found a way to get it down to two instructions yet. I feel like it should be possible now. Your soltuion is basically pcmpeqb, pmovmsk, test. I think it should be possible to to pcmpxxx, ptest. – Z boson Nov 12 '14 at 13:48
  • Yes, it ought to be possible - it seems that there is a fundamental flaw in the design of `_mm_testX_si128` (`PTEST`) though, in that you can't easily use it to test for all 1s, so you always need an extra instruction to invert all the bits at some point. – Paul R Nov 12 '14 at 13:52
  • @PaulR, yeah exactly. AVX512 has a cmpneq instruction. That would solve it. I can get it to `__m128i eq = _mm_cmpeq_epi8(v1,v2); eq = _mm_xor_si128(eq,_mm_set1_epi32(-1)); _mm_test_all_zeros(eq,eq);`. That generates four instructions but `_mm_set1_epi32(-1)` could be precomputed (it generate `pcmpeq xmm1, xmm1` anyway) so this could be seen as three instructions. – Z boson Nov 12 '14 at 13:56
  • Maybe we should put up a question with a big bounty to see if anyone can come up with a better solution, i.e. two instructions resulting in a flag condition? – Paul R Nov 12 '14 at 13:59
  • @PaulR, I found the solution with only two instructions. See my updated answer. – Z boson Nov 12 '14 at 14:51
  • Oh yes - of course! I was even thinking about using XOR yesterday but had written it off thinking it was just the same as cmpeq and so didn't buy us anything. But of course the result of XOR is *inverted* compared to cmpeq, so that solves the inversion problem. Well done! (Saved me some bounty too, as I hadn't got round to asking the question yet!) – Paul R Nov 12 '14 at 15:03
  • P.S. I've added an intrinsic version of your SSE 4.1 PXOR/PTEST solution to my answer for the sake of completeness, giving you credit of course - I hope that's OK. – Paul R Nov 12 '14 at 15:12
  • @PaulR, thanks, I kept thinking I wanted bitwise equal which is `~(a^b)` before I realized in a boring meeting that I only needed `a^b` (bitwise not equal). I don't mind at all that you used my solution in your answer. – Z boson Nov 12 '14 at 18:57
  • 2
    A word of warning: the trick does not work with floating-point vectors, due to the possibility of +/-0. – user1095108 Feb 29 '16 at 09:19
  • @user1095108: Good point. Also `NaN == NaN` is supposed to be false. – Nemo Mar 13 '16 at 22:46
  • 1
    `ptest` is not actually faster on compare results if you're branching: it's 2 uops plus 1 for the jcc. `pmovmskb` is 1, and `cmp/jcc` macro-fuses to 1. But on CPUs where `pxor` can run on more ports than `pcmpeqb/w/d/q`, this is interesting. – Peter Cordes Feb 27 '19 at 05:02
10

You can use a compare and then extract a mask from the comparison result:

__m128i vcmp = _mm_cmpeq_epi8(v0, v1);       // PCMPEQB
uint16_t vmask = _mm_movemask_epi8(vcmp);    // PMOVMSKB
if (vmask == 0xffff)
{
    // v0 == v1
}

This works with SSE2 and later.

As noted by @Zboson, if you have SSE 4.1 then you can do it like this, which may be slightly more efficient, as it's two SSE instructions and then a test on a flag (ZF):

__m128i vcmp = _mm_xor_si128(v0, v1);        // PXOR
if (_mm_testz_si128(vcmp, vcmp))             // PTEST (requires SSE 4.1)
{
    // v0 == v1
}

FWIW I just benchmarked both of these implementations on a Haswell Core i7 using clang to compile the test harness and the timing results were very similar - the SSE4 implementation appears to be very slightly faster but it's hard to measure the difference.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • It might be tricky to time. You may need to unroll. The latency of xor is 1 and ptest 2. Whereas cmpeq it's 1, movemask 3, test, 1. So from a latency point of view the SSE4.1 method is about half the latency. The reciprocal throughput of xor 0.33, ptest 1 whereas cmpeq it's 0.5, movemask 1, test 0.25. For reciprocal throughput it's closer. – Z boson Nov 12 '14 at 20:12
  • I did some unrolling and a few other things to see if I could make the difference any bigger but it's still relatively small (a few percent). The compiler (clang) is generating the same three instructions after the test in each case, so the unrolled instruction sequence is the same in both cases apart from the 3 SSE2 instructions (6 instructions total per iteration) versus the 2 SSE4 instructions (5 instructions total per iteration). It's using SETE after the test in both cases, so no branching. I'm keeping the data set well within L2 cache. – Paul R Nov 12 '14 at 22:21
  • Thanks for checking this. I guess I'm just disappointed that my clever solution is not so much better afterall (yet). Probably you would have to find a case where the total fused microops is four with the SSE4 version but the SSE2 version pushes it past four or where the SSE2 version needs the same port twice. You could use IACA to check this. Probably it shows that in both your current tests there is no difference. I mean the Block Throughput is the same. So you would have to find a test where it makes a difference. That's the only thing I can think of. – Z boson Nov 13 '14 at 08:36
  • In case you want more info on [IACA](https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it/26021338#26021338). – Z boson Nov 13 '14 at 08:37
  • The story may be different on different CPUs - I only tested on Haswell. It's probably still worth using the SSE4 version as even if it only makes a small difference now, things may be different on a future CPU. Thanks for the IACA link - I'll take a look at that. – Paul R Nov 13 '14 at 08:38
  • The IACA analysis is interesting - throughput is 2 cycles for both (actually 2.05 cycles for the SSE4 version), and it's 6 uops for both, even though the SSE2 version has one more instruction. There is also a little port pressure in the SSE4 version as both pxor and ptest use port 5, which is identified by IACA as the bottleneck. – Paul R Nov 13 '14 at 09:08
  • That explains (in theory) why they get the same time. The fact that pxor and ptest use the same port is disappointing. The uops IACA reports is misleading. It reports the total microps not the fused ones. What matters is the total fused microps (though they may be equal in this case). – Z boson Nov 13 '14 at 09:12
  • This is a good observation. I know that counting instructions is not necessarily a good metric since instructions have different latency and throughput. So two instructions is not necessarily better than three. But I also should keep in mind that the ports used is another important thing to consider. So latency, throughput, microps, and ports all matter. It's not easy to know what is best. – Z boson Nov 13 '14 at 09:16
  • 1
    Well there is a serial dependency between the pxor and the ptest anyway, so I'm not sure that it matters that they are on the same port in this case, but it's been a useful exercise anyway, and of course, who knows what future architectures will do with this. – Paul R Nov 13 '14 at 09:16
  • Fewer instructions can be important in some cases, e.g. one extra instruction might cause the LSD to spill. Anyway, I'm grateful for the pointer to IACA - that's definitely going to be a useful tool for me in the future - thanks. – Paul R Nov 13 '14 at 09:17
  • 1
    You're welcome, if I had known about IACA I probably would not have spent a 500 rep bounty on this https://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62 But I learned so much from that so I'm glad I did it. – Z boson Nov 13 '14 at 09:21
  • Heh - well you're up to 29 upvotes on the question (30 now), so you clawed back some of that bounty. – Paul R Nov 13 '14 at 09:30
  • Hehe. Thanks! Yeah, I hoped I would get some back. I spent a lot of time making that question and especially the code useful (meaning painless for others to implement). – Z boson Nov 13 '14 at 09:32
  • Sure - and you never know, it might help at your next job interview! – Paul R Nov 13 '14 at 09:34
-2

Consider using an SSE4.1 instruction ptest:

if(_mm_testc_si128(v0, v1)) {if equal}

else {if not} 

ptest computes the bitwise AND of 128 bits (representing integer data) in a and mask, and return 1 if the result is zero, otherwise return 0.

jww
  • 97,681
  • 90
  • 411
  • 885
  • A bitwise AND does not test for equality *per se* - you need to do a compare first and then test the result of the comparison. – Paul R Dec 04 '17 at 17:24
  • This is certainly wrong. I made this mistake today. You do NOT test equivalence with "testc". "TestC" first of all generates "and-not" and then checks if all bits are zero. Therefore, _mm_testc_si128(anything, zero) will ALWAYS return true. In particular, testc(anything, zero) is effectively "zero & (~anything) == 0", which is always true. TestC is certainly a useful function, but it doesn't answer this particular question. If you wanted to "only" do an and operation, you should use "testz" instead. – Dragontamer5788 Oct 06 '18 at 19:28
  • 1
    [Can PTEST be used to test if two registers are both zero or some other condition?](//stackoverflow.com/q/43712243) no, it can't. And definitely not equality. – Peter Cordes Feb 27 '19 at 05:01