4

I have 16 byte 'strings' (they may be shorter but you may assume that they are padded with zeros at the end), but you may not assume they are 16 byte aligned (at least not always).

How to write a routine that will compare them (for equality) with SSE intrinsics? I found this code fragment that could be of help but I', not sure if it is appropriate?

register __m128i xmm0, xmm1; 
register unsigned int eax; 

xmm0 = _mm_load_epi128((__m128i*)(a)); 
xmm1 = _mm_load_epi128((__m128i*)(b)); 

xmm0 = _mm_cmpeq_epi8(xmm0, xmm1); 

eax = _mm_movemask_epi8(xmm0); 

if(eax==0xffff) //equal 
else   //not equal 

Could someone explain this or write a function body?

It needs to work in GCC/mingw (on 32 bit Windows).

Paul R
  • 208,748
  • 37
  • 389
  • 560
user2214913
  • 1,441
  • 2
  • 19
  • 29
  • 4
    `_mm_load_epi128` doesn't exist, you mean either `_mm_load_si128` or since you said they can be unaligned, `_mm_loadu_si128` – harold Aug 13 '15 at 22:43
  • Semi-related: [Find the first instance of a character using simd](https://stackoverflow.com/q/40915243) – Peter Cordes Feb 20 '22 at 22:18

3 Answers3

5

Vector comparison instructions produce their result as a mask, of elements that are all-1s (true) or all-0s (false) according to the comparison between the corresponding source elements.

See https://stackoverflow.com/tags/x86/info for some links that will tell you what those intrinsics do.

The code in the question looks like it should work.

If you want to find out which elements were non-equal, then use the movemask version (pmovmskb or movmskps). You can tzcnt / bsf to bit-scan for the first match, or popcnt to count matches. All-equal gives you a 0xffff mask, non-equal gives you 0.


You might wonder if SSE4.1 ptest is useful here. It's usable but it's not actually faster, especially if you're branching on the result instead of turning it into a boolean 0 / -1.

 // slower alternative using SSE4.1 ptest
__m128i avec, bvec;
avec = _mm_loadu_si128((__m128i*)(a)); 
bvec = _mm_loadu_si128((__m128i*)(b)); 

__m128i diff = _mm_xor_si128(avec, bvec);  // XOR: all zero only if *a==*b

if(_mm_test_all_zeros(diff, diff))  { //equal 
} else {   //not equal 
}

In asm, ptest is 2 uops, and can't macro-fuse with a jcc conditional branch. So the total pxor + ptest + branch is 4 uops for the front-end, and still destroys one of the inputs unless you have AVX to put the xor result in a 3rd register.

pcmpeqb xmm0, xmm1 / pmovmskb eax, xmm0 / cmp/jcc is a total of 3 uops, with the cmp/jcc fusing into 1 uop on Intel and AMD CPUs.

If you have wider elements, you can use movmskps or movmskpd on the result of pcmpeqd/q to get a 4-bit or 2-bit mask. This is most useful if you want to bit-scan or popcnt without dividing by 4 or 8 bytes per element. (Or with AVX2, 8-bit or 4-bit instead of 32-bit mask.)

ptest is only a good idea if you don't need any extra instruction to build an input for it: test for all-zeros or not, with or without a mask. e.g. to check some bits in every element, or in some elements.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    It seems that `ptest` instruction uses 2 uops on current Intel architectures. Moreover, it is **not** fused with conditional jump after it. So your solution generates 4 uops, while OP's code generates 3 uops only. See [this question](https://software.intel.com/en-us/forums/topic/293050) for details. Supposedly this means that your solution would be slower in a tight loop. – stgatilov Aug 27 '15 at 09:26
  • @stgatilov: Good catch. `PTEST` may have lower latency (lower mispredict penalty), but I think you're right, in this case `pcmpeq/pmovmskb/ cmp/jne` is fewer uops, because `ptest` can't compare-equal directly. A load can fold into `pcmpeq` just as well as into `pxor`. I'm not sure about the mispredict penalty, either: I've just looked at Agner Fog's latency tables, not done experiments. – Peter Cordes Aug 27 '15 at 19:27
  • 2
    I have observed reduced throughput both in real benchmark and in IACA analysis results after replacing `pmovmskb` with `ptest`. It seems that `ptest` instruction can be useful only if you absolutely need to check **all** 128 bits of the register for being zero (which happens very seldom). If you are doing any comparison (which is the most popular case), then using `ptest` after it is harmful. Well, maybe it has lower latency, who knows... – stgatilov Aug 29 '15 at 07:31
  • @stgatilov: updated my answer. Note that `pcmpeq / ptest` can only test for all-*not*-equal. That's why I used `pxor / ptest` to keep the same semantics as the OP's code. `pcmpeq` should be just as fast as `pxor` on Intel, though. `ptest` runs on p0 and p5 (with 2c latency). `pcmpeqb` can run on port1 or port5, so it can use a port that `ptest` doesn't. (`pxor` can run on all 3 vector execution ports, p0/1/5.) Your test might have had fewer branch mispredicts than you expected, though, if you had different semantics. – Peter Cordes Aug 29 '15 at 07:53
  • I was testing some more complex code (related to UTF conversion), which was *frontend*-limited as IACA reported. So ports were not important for me, only number of uops. Also, the branches were of "never-take" or "always-take" types (used to return on error), so they were perfectly predicted. Maybe `ptest` is not worse in other cases, but for me it is now highly questionable if it is usable. – stgatilov Aug 29 '15 at 08:20
  • @stgatilov: makes sense. When I've looked at `ptest`, I think it's usually been equal uops, but potentially lower latency. (except in this case where I got mixed up and suggested it in a case where it's more uops. >.<) – Peter Cordes Aug 29 '15 at 08:23
1

Well, I'm not sure if this would be faster, but it can be done with a single SSE 4.2 instruction-instrinsic: checking PCMPISTRI (Packed Compare Implicit Length Strings, Return Index) for carry and/or overflow flags:

if (_mm_cmpistrc(a, b, mode))   // checks the carry flag (not set = equal)
  // equal
else
  // unequal

mode would be (for your case):

const int mode = 
  SIDD_UBYTE_OPS |         // 16-bytes per xmm
  SIDD_CMP_EQUAL_EACH |    // strcmp
  SIDD_NEGATIVE_POLARITY;  // find first different byte

Unfortunately this instruction is poorly documented. So if anyone finds a decent resource aggregating all combinations of mode and the resulting flags, please share.

zx485
  • 28,498
  • 28
  • 50
  • 59
  • It wouldn't work, since it wouldn't compare bytes beyond a zero-byte in either register. Also, no, it wouldn't be faster. PCMPISTRI is the lightest weight SSE4.2 string insn, but it's still 3 uops (all for port0), and 11 cycle latency on Haswell. The insn sets flags as well as setting `ecx`, but `test/jcc` is still one uop, same as `jcc` alone. Intel's insn ref manual documents the hell out of everything, but understanding the descriptions for the string ops is slow going, because they're very complicated and have many modes. http://stackoverflow.com/tags/x86/info has links. – Peter Cordes Aug 27 '15 at 19:39
  • 1
    Oops, forgot the OP was asking about zero-padded strings, not arbitrary 16B vectors. The string instructions have possibilities, but won't be as fast as simple comparisons for testing exact equality. If the OP had strings with one zero-byte and then trailing garbage, the string instructions are probably the best choice for strings with length `<= 15`. – Peter Cordes Aug 27 '15 at 19:52
  • @PeterCordes: Thanks for your additions. I just added this to show another way of solving this or a slightly different problem. --- Reading the intel man again, I retract that it is poorly documented. The intel arch man does explain all the bits, but from there it's long way to get something useful out of it. Now looking into the intel opt man I do admit it is explained in more verbosity with examples in chapter 10. – zx485 Aug 28 '15 at 10:08
  • thanks for the docs tip on ch10. I was just looking at Section 4.1 (which is part of the insn set ref manual). It has a block diagram that shows the processing steps, making it easier to see which step each `imm8` bit affects. Totally agree that it can take a long time to get from the docs to working code with these multi-function instructions! It's kinda too bad they're still micro-coded, so barely worth using. It's a chicken-and-egg thing I guess; not worth spending silicon to make them fast until there are many users... – Peter Cordes Aug 28 '15 at 15:28
  • Decent resource that explains the ways `pcmpistri` can work: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 – Peter Cordes Feb 27 '19 at 05:28
1

I'll try to help with the forgotten Could someone explain this part of the question.

register __m128i xmm0, xmm1; 
register unsigned int eax; 

Here we declare some variables. __m128i is a builtin type for integer operations on SSE registers. Note that the names of the variables do not matter at all, but the author has named them exactly as the corresponding CPU registers are called in assembly. xmm0, xmm1, xmm2, xmm3, ... are all the registers for SSE operations. eax is one of the general-purpose registers.

register keyword was used long time ago to advise compiler to place variable in a CPU register. Today it is completely useless, I think. See this question for details.

xmm0 = _mm_loadu_si128((__m128i*)(a)); 
xmm1 = _mm_loadu_si128((__m128i*)(b)); 

This code was modified as @harold suggested. Here we load 16 bytes from given memory pointers, which may be unaligned) to variables xmm0 and xmm1. In assembly code these variables most likely would be located directly in registers, so this intrinsics would generate unaligned memory load. Converting pointer to __m128i* type is necessary because intrinsic accepts this pointer type, though I have no idea why Intel did it.

xmm0 = _mm_cmpeq_epi8(xmm0, xmm1); 

Here we compare for equality each byte from xmm0 variable with corresponding byte in xmm1 variable. Suffix _epi8 means operating on 8-bit elements, i.e. bytes. It is somewhat similar to memcmp(&xmm0, &xmm1, 16), but generates other results. It returns a 16-byte value, which contains 0xFF for each byte with equal values, and 0x00 for each byte with different values.

eax = _mm_movemask_epi8(xmm0); 

This is a very important instruction from SSE2, which is usually used to write an if statement with some SSE condition. It takes the highest bit from each of 16 bytes in XMM argument, and writes them into a single 16-bit integer number. On assembly level, this number is located in general-purpose register, allowing us to check its value quickly just afterwards.

if(eax==0xffff) //equal 
else   //not equal

If all the 16 bytes of two XMM registers were equal, then _mm_cmpeq_epi8 must return a mask with all 128 bits set. _mm_movemask_epi8 would then return full 16-bit mask, which is 0xFFFF. If any two compared bytes were different, corresponding byte would be filled with zeros by _mm_cmpeq_epi8, so _mm_movemask_epi8 would return 16-bit mask with corresponding bit not set, so it would be less than 0xFFFF.

Also, here is the explained code wrapped into a function:

bool AreEqual(const char *a, const char *b) {
  __m128i xmm0, xmm1; 
  unsigned int eax; 
  xmm0 = _mm_loadu_si128((__m128i*)(a)); 
  xmm1 = _mm_loadu_si128((__m128i*)(b)); 
  xmm0 = _mm_cmpeq_epi8(xmm0, xmm1); 
  eax = _mm_movemask_epi8(xmm0); 
  return (eax == 0xffff); //equal 
}
Community
  • 1
  • 1
stgatilov
  • 5,333
  • 31
  • 54