0

According to "Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)" (Intel, 2008) [they] added instructions to assist in character searches and comparison on two operands of 16 bytes at a time. I wrote some basic strlen() and strcmp() functions in C, but they seem slower than glibc.

I would like to maybe experiment with using inline assembly to see how my project behaves with inputting/outputting XML.

I've read (on here) that using SMID on things like strlen() is rife with potential problems (memory alignment), so I'm a little concerned about using it in production code.

user1016031
  • 123
  • 1
  • 7
  • They're not faster for strlen / strcmp. Use SSE2 or AVX2 `pcmpeqb` like glibc does, or just *use* glibc's functions. Certainly scalar C will be slower because gcc/clang can't auto-vectorize loops unless the trip-count is known at compile time. That rules out search loops with a data-dependent break; you have to manually vectorize. SIMD strlen is very possible; that's what glibc uses with hand-written asm, you just have to be careful: [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) shows how. – Peter Cordes Oct 26 '20 at 03:46
  • What compiler? GCC? MSVC? – Michael Petch Oct 26 '20 at 03:46
  • see [Implementing strcmp, strlen, and strstr using SSE 4.2 instructions](https://www.strchr.com/strcmp_and_strlen_using_sse_4.2) – phuclv Oct 26 '20 at 04:24
  • Now basically a duplicate of [How much faster are SSE4.2 string instructions than SSE2 for memcmp?](https://stackoverflow.com/q/46762813) - no, you very rarely want SSE4.2. – Peter Cordes Oct 26 '20 at 06:18
  • If your use-case includes buffers that are known to be aligned, or where you know for some other reason it's safe to read up to 15 bytes past the end of the string, then yes hand-rolled can be worth it, usually using SSE2. Especially if you know something about your typical string lengths being usually 16 to 31 bytes or something. (Usually-short strings like under 16 bytes could possibly make `pcmpistri` worth it, esp. for strcmp.) But really if finding string lengths is important, it's often best to use explicit-length strings where you save a length with the pointer. – Peter Cordes Oct 26 '20 at 06:19

1 Answers1

3

glibc's implementations will be hard to beat. These functions are carefully optimized and include pieces hand written in assembly. Here is glibc's x86_64 implementation of strcmp, using AVX2 instructions. Be warned, it is 800 lines:
https://github.com/lattera/glibc/blob/master/sysdeps/x86_64/multiarch/strcmp-avx2.S

For more detail, read also Peter Codes' fantastic explanation about glibc's implementation.

Pascal Getreuer
  • 2,906
  • 1
  • 5
  • 14
  • As Pascal mentioned, glibc's implementations will be hard to beat. They conditionally compile for all processor versions and SIMD implementations (including AVX). Which sequence to use is somewhat model specific (e.g. Some sequences are better/worse than others on a given model). glibc does the heavy lifting for you. – Craig Estey Oct 26 '20 at 03:02
  • 1
    @CraigEstey: They don't conditionally *compile*, they do runtime CPU dispatch via the dynamic linker resolver mechanism. So a single build of glibc can work on any CPU, but use AVX2 on systems where that's available. With all the dispatch cost done once at dynamic-link time, because calls into shared libraries already go through a function pointer. – Peter Cordes Oct 26 '20 at 03:42
  • 1
    @pascal: Note that glibc *doesn't* actual dispatch to its SSE4.2 strcmp function on any real CPUs (AFAIK) because it's not faster. AVX2 `vpcmpeqb` is much better, and even the SSE2 version is better, because `pcmpistri` is microcded as multiple uops even on the latest CPUs. See links at the top of [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035) – Peter Cordes Oct 26 '20 at 03:45
  • @PeterCordes thanks for these details! Your post is excellent. I have updated my answer to refer to the strcmp-avx2.S and added a pointer to your post. – Pascal Getreuer Oct 26 '20 at 05:04