3

I have a struct like this:

struct {
    uint32_t a;
    uint16_t b;
    uint16_t c;
    uint16_t d;
    uint8_t  e;
} s;

and I would like to compare two of the above structs for equality, in the fastest way possible. I looked at the Intel Intrinsics Guide but couldn't find a compare for integers, the options available were mainly doubles and single-floating point vector-inputs.

Could somebody please advise the best approach? I can add a union to my struct to make processing easier.

I am limited (for now) to using SSE4.2, but any AVX answers would be welcome too if they are significantly faster. I am using GCC 4.8.2

user997112
  • 29,025
  • 43
  • 182
  • 361
  • 2
    You can use any integer comparison from the `PCMPEQ` family, I don't see your problem. – Jester Jun 11 '15 at 22:21
  • 3
    This struct is essentially packed. Can you assume that to always be true? It seems like a `memcmp(&s1, &s2, sizeof(struct s))` might be the least time investment. Take advantage of whatever optimization `memcmp` has to offer. – Jonathon Reinhart Jun 11 '15 at 22:22
  • @JonathonReinhart yes can assume packed. – user997112 Jun 11 '15 at 22:31
  • @Jester _mm_cmpeq_epi64 only accepts 64 bit inputs? – user997112 Jun 11 '15 at 22:38
  • 1
    That structure is only 11 bytes; so you're going to need to either mask off the unused 5 bytes (in case they contain trash) before doing the comparison; or compare packed bytes and mask the results from the unused 5 bytes. On modern 64-bit computers it would probably be faster to use `uint16_t e` to make it 12 bytes and do a 64-bit compare and a 32-bit compare using boring old "non-SIMD" integer comparisons (especially if you're not processing arrays of these structures). – Brendan Jun 11 '15 at 22:48
  • @Brendan would I union the first four struct members then, to get the 64 bits and therefore compare the union (with the other union) and uint16_e with the other e member? – user997112 Jun 11 '15 at 23:04
  • `_mm_cmpeq_epi64` compares 128 bits in two halves. Your question title says you want to compare 16 bytes, and that is what this does. The rest of your question with all the unions and whatnot is unclear. Also you didn't say what kind of result you want. You might need to add a `PTEST` afterwards. – Jester Jun 11 '15 at 23:08
  • @Jester: If your're going to `ptest` anyway, why not `pxor` instead of `pcmpeq`? – EOF Jun 12 '15 at 00:47
  • Yeah that would work too. – Jester Jun 12 '15 at 01:01
  • I know you may think it's off-topic, but did you try to simply program comparison in regular C language and recompile it with -O3? Such code should be easy to auto-vectorize in most efficient way. And in case it doesn't work out of the box (because you by some reasons had to use tons of complex data structure and pointers/aliasing) - you can update to GCC4.9, where you will get full support for #pragma omp simd, #pragma ivdep and all other portable "explicit vectorization" means which will push compiler to auto-vectorize loop of interest. – zam Jun 12 '15 at 09:52
  • Regarding masking and scalar vs. integer: there are plenty of techniques to still vectorize such codes: horizontal instructions for SSE (with moderate speed-up), the same + more or less effective masking on AVX/AVX2 (better speed-up), effective masking with future AVX512. So intrinsics and auto-vectorizer implementations exist; the question is does it worth it for SSE? Do you want to manually re-code it for every next platform as opposed to keep it up to compiler, etc. – zam Jun 12 '15 at 09:59

2 Answers2

2

What @zx485 should have written is:

.data
  mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
  pxor xmm1, xmm2  ; equiv to psubb, but runs on all 3 vector execution ports
  ptest xmm1, xmmword ptr [mask11byte]   ; SSE 4.1
  setz al     ; AL=TRUE for equal

As long as nothing bad happens (floating point exceptions), you don't need to mask off your operands before computation, even if they hold garbage. And since PTEST does a bitwise AND as part of its operation, you don't need a separate PAND at all.

For a while, I thought I had a version that could use less space and fewer uops, but I ended up needing an extra instruction because there's no pcmpneq (so I needed a logical not). So it's smaller, the same number of uops, but significantly worse latency.

.code
  PCMPEQB xmm1, xmm2  ; bytes of xmm1 = 0xFF on equal
  PMOVMSKB eax, xmm1  ; ax = high bit of each byte of xmm1
  NOT eax
  TEST eax, 0x7FF  ; zero flag set if all the low 11 bits are zero
  SETZ al    ; 17 bytes

; Or one fewer insn with BMI1's ANDN.  One fewer uop if test can't macro-fuse
  ANDN eax, eax, [mask11bits]   ; only test the low 11 bits.
;  ANDN version takes 20 bytes, plus 2B of data
.data
  mask11bits dw 07ffh

test can macro-fuse with jcc, so if you're using this as a jump condition instead of actually doing setz, you come out ahead on size. (since you don't need the 16B mask constant.)

ptest takes 2 uops, so the ptest version is 4 uops total (including the jcc or other instruction). The pmovmskb version is also 4 uops with a test/jcc macro-fused branch, but 5 with cmovcc / setcc. (4 with andn, with any of setcc / cmovcc / jcc since it can't macro-fuse`.)

(Agner Fog's table says ptest takes 1 fused-domain uop on Sandybridge, 2 on all other Intel CPUs that support it. I'm not sure I believe that, though.)

Latency on Haswell (important if the branch doesn't predict well):

  • pxor: 1 + ptest: 2 = 3 cycles
  • pcmpeqb: 1 + pmovmskb: 3 + not: 1 + test: 1 = 6 cycles
  • pcmpeqb: 1 + pmovmskb: 3 + andn: 1 = 5 cycles (but not macro-fused, so possibly 1 more cycle of latency?)

So the ptest version has significantly shorter latency: jcc can execute sooner, to detect branch mispredicts sooner.

Agner Fog's tests show ptest has latency = 3 on Nehalem, 1 on SnB/IvB, 2 on Haswell.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

A simple solution would be to just subtract the two structs byte wise after masking so you get an all-zero-value only if all packed bytes are identical. This code is in MASM format, but you surely can adapt that to gcc AT&T syntax or intrinsicals:

.data
  mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
  pand  xmm1, xmmword ptr [mask11byte]
  pand  xmm2, xmmword ptr [mask11byte]
  psubb xmm1, xmm2
  ptest xmm1, xmm1   ; SSE 4.1
  setz al     ; AL=TRUE for equal

Addition: Because the size of the struct is 11 byte, 256bit/32byte-AVX(x) would make no sense.

zx485
  • 28,498
  • 28
  • 50
  • 59
  • ...or, you could use `pxor` instead of `psub`, and only `pand` once. – EOF Jun 14 '15 at 23:19
  • You don't need `pand` at all. Use the mask as one operand to `ptest`; that's what it's for! You can mask after either `psubb` (byte-wise subtraction) or `pxor`, but `pxor` can run on one port that `psubb` can't (SnB port0), so good call on that. – Peter Cordes Jul 03 '15 at 01:55