0

I am attempting to build a header parser with fast processing. I have two issues, one is that there is a bug in the code below.

void parse_with_simd(const char *buffer, const int buffer_len) {
  const char * value = "GET ";
  __m128i u_str = _mm_loadu_si128((const __m128i *)value);
  __m128i loaded = _mm_loadu_si128((const __m128i *)buffer);
  int not_equal = _mm_cmpestrc(loaded, 4, u_str, 4,
                               _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_EACH);
  debug_print("has value %d\n", not_equal);

This is what I have, debug_print is outputting that not_equal = 1.

Here is my test case.

#include "../src/header/header.h"

int main() {
  parse_with_simd("GET / HTTP/1.1", 14);
  parse_with_simd("OPTIONS /",9);
}

Both strings show that the not_equal is 1.

  • Oh, I'm sorry that's making them not equal, let me edit, and rerun just to be sure when they ARE equal. sorry. – Christopher Clark Aug 23 '21 at 00:32
  • 2
    Note that for the case of a 4-character constant string, SIMD is overkill since a simple 32-bit compare will suffice. `memcmp(buffer, "GET ", 4) == 0` compiles into the obvious `cmp / sete` instructions. – Nate Eldredge Aug 23 '21 at 00:41
  • 1
    Don't you also have an overrun problem if the string is less than 16 bytes long and near the end of a page? – Nate Eldredge Aug 23 '21 at 00:42
  • @NateEldredge: Yes, you need to beware of [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739), as discussed in comments on an answer to the querent's previous question: [Can't convert value to a vector with Intel Intrinsics](https://stackoverflow.com/posts/comments/121740473) (which proposed using variable-length memcpy into a `__m128i` variable, which would likely cause a store-forwarding stall and make this even slower. You def. want to arrange for it to be safe to load 16 bytes, which isn't guaranteed for string literals. – Peter Cordes Aug 23 '21 at 01:17
  • For finding exact matches at the start of a buffer (i.e. memcmp), probably best to use `pcmpeqb` / `pmovmskb` to get a bitmask of matches, then `match & 0x0f == 0x0f` to check that the first 4 bytes all matched. `pcmpestri` or `pcmpestrm` are more total uops than that (8 on Skylake, with one per 3 cycle throughput - https://uops.info/table.html). Or just let GCC inline `memcmp`, if it does a decent job for string lengths other than 4, 8, or 16. – Peter Cordes Aug 23 '21 at 01:22
  • 2
    @christopherclark: `_mm_loadu_si128` loads the full 16 bytes unconditionally, so if the string was too short, you are dead before any kind of comparison or length check occurs at all. – Nate Eldredge Aug 23 '21 at 01:28
  • Yes, that's the intrinsic. I tend to type asm mnemonics because they're shorter and often clearer. [Compare 16 byte strings with SSE](https://stackoverflow.com/q/31999284) shows how to use it (and `_mm_movemask_epi8`). – Peter Cordes Aug 23 '21 at 02:02
  • @PeterCordes last question. Since you know about the operation times. It's not just detecting gets. If it is not a get, then I need to check for `Post, Put, Delete, and finally Options` would intrinsics still be faster for this if else tree? – Christopher Clark Aug 23 '21 at 22:17
  • GET is probably the most common by far, then maybe PUT, so check for those two first, in that order. The performance of parsing DELETE and OPTIONS requests are probably much less important, but yeah you can still check for them with SIMD, or with integer compares (e.g. hopefully memcmp will inline nicely). It might be tricky to get a compiler to emit an 8-byte load and then left-shift by 3*8 bits to discard the top bytes, leaving the 5 you want to compare, so it might be easier to get good performance with a `pcmpeqb` / `pmovmskb` and check for the first 5 mask bits being set for "POST " – Peter Cordes Aug 23 '21 at 22:33
  • With AVX, you could do an 8-byte broadcast load so you could share some of the work in checking for GET and POST, just processing the 16-bit mask differently. There's a `_mm256_broadcast_sd(double const *)` which is strict-aliasing safe, but strangely not a 128-bit version of it, so you might need to wrestle with the compiler to get it to emit a `vbroadcastsd` with C source that's fully strict-aliasing safe to broadcast into a `__m128i`. – Peter Cordes Aug 23 '21 at 22:37

0 Answers0