Why is strcmp not SIMD optimized?

Question

I've tried to compile this program on an x64 computer:

#include <cstring>

int main(int argc, char* argv[])
{
  return ::std::strcmp(argv[0],
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really long string"
  );
}

I compiled it like this:

g++ -std=c++11 -msse2 -O3 -g a.cpp -o a

But the resulting disassembly is like this:

   0x0000000000400480 <+0>:     mov    (%rsi),%rsi
   0x0000000000400483 <+3>:     mov    $0x400628,%edi
   0x0000000000400488 <+8>:     mov    $0x22d,%ecx
   0x000000000040048d <+13>:    repz cmpsb %es:(%rdi),%ds:(%rsi)
   0x000000000040048f <+15>:    seta   %al
   0x0000000000400492 <+18>:    setb   %dl
   0x0000000000400495 <+21>:    sub    %edx,%eax
   0x0000000000400497 <+23>:    movsbl %al,%eax
   0x000000000040049a <+26>:    retq

Why is no SIMD used? I suppose it could be to compare, say, 16 chars at once. Should I write my own SIMD strcmp, or is it a nonsensical idea for some reason?

TBH who cares? Use `std::string::operator==`. Checking string lengths up front is a very effective optimization. Also: which compiler, which settings? — MSalters, Oct 27 '14 at 11:03
I'm currently writing a parser and sometimes, for performance reasons, the old `::std::strcmp` pops up. The compiler is `g++-4.9.1`. — user1095108, Oct 27 '14 at 11:05
Doesn't the null terminators make this difficult? Because the compiler can't simply assume that there's 16 bytes worth of characters to be read. There might be 1. — ta.speot.is, Oct 27 '14 at 11:08
That's why the O(1) length test of `std::string` is so good. Not only do you know whether there's a point to comparing contents, when the lengths are equal you also know _how much_ content you need to compare. And therefore I don't believe the claim that `strcmp` "pops up for performance reasons". (GCC's has an outdated std::string implementation, that could also matter) — MSalters, Oct 27 '14 at 11:11
strcmp compare 2 null terminated C strings. So if you want to use SIMD you need to find the length first to ensure you didn't get out of the range. But to find the length you need to compare every char with NULL in both strings. So while you will be comparing every char in your C strings with NULL, strcmp will already return a result before you will load your SIMD instructions. — JustAnotherCurious, Oct 27 '14 at 11:11
@JustAnotherCurious actually, `std::string` stores the length of the string upon any changes made. So if `std::string` is used everywhere, it may be faster for comparison. — RamblingMad, Oct 27 '14 at 11:13
@AdrianoRepetti No, `strcmp` should not be locale aware. Are you thinking of `strcoll`? — David Heffernan, Oct 27 '14 at 11:16
@DavidHeffernan _"The comparison is done lexicographically."_, you're right and I need a long break... — Adriano Repetti, Oct 27 '14 at 11:19
@JustAnotherCurious: Checking for zero bytes can be done with SIMD instructions. And latest intel processors have some very fancy string instructions that can be used to support both strcmp and std::string == (8 and 16 bit) — gnasher729, Oct 27 '14 at 11:23
I don't know if there is any string operation support in SSE2 but there are in [SSE4.2](http://en.wikipedia.org/wiki/SSE4#SSE4.2), did you try that if your computer supports SSE4.2? http://www.strchr.com/strcmp_and_strlen_using_sse_4.2 — phuclv, Oct 27 '14 at 11:30
besides, x86_64 implies the existence of SSE2, so you don't need to explicitly declare that to gcc — phuclv, Oct 27 '14 at 11:34
@CoffeeandCode I agree, but the question was about raw C strings without length stored, so I just pointed on that problem. Of course using std::string in c++ is better than raw c strings. — JustAnotherCurious, Oct 27 '14 at 12:54

score 45 · Accepted Answer · answered Oct 27 '14 at 11:31

45

In a SSE2 implementation, how should the compiler make sure that no memory accesses happen over the end of the string? It has to know the length first and this requires scanning the string for the terminating zero byte.

If you scan for the length of the string you have already accomplished most of the work of a strcmp function. Therefore there is no benefit to use SSE2.

However, Intel added instructions for string handling in the SSE4.2 instruction set. These handle the terminating zero byte problem. For a nice write-up on them read this blog-post:

http://www.strchr.com/strcmp_and_strlen_using_sse_4.2

answered Oct 27 '14 at 11:31

Nils Pipenbrinck

83,631
31
151
221

5

I've tried immediately to compile with `-msse4.2` and the same machine code is generated. Good to know I wasn't completely wrong, that it could be done, though. – user1095108 Oct 27 '14 at 11:32
1

using a lenght-prefix string type like Pascal or std::string you can determine if it has passed the string termination – phuclv Oct 27 '14 at 11:38
5

It doesn't have to ensure that no memory accesses happen over the end of the string. It only has to ensure that no memory accesses happen on pages that the string is not at least partly on, and that's much easier. – harold Oct 27 '14 at 12:01
@harold, the limits are even more loose, you just have to ensure that there is anything at all after the last address. Using x-aligned memory for the strings solves the problem without checks. – Surt Oct 27 '14 at 12:08
@Surt fair enough, but that doesn't change much does it? It only really knows about that one string, unless it's in statically allocated memory. – harold Oct 27 '14 at 12:11
1

@harold, if your allocator always return aligned memory to strings the check becomes superfluous. (and I was agreeing with you :)) – Surt Oct 27 '14 at 12:17
@Surt: For plain `char const*` though, you know not their alignment (upfront); it does not matter though, the first step will simply be to align on a x-bytes boundary before performing the SSE specialized algorithm – Matthieu M. Oct 27 '14 at 12:47
The size of the string is not necessary to know. Reading 15 extra bytes is not a problem for stack allocated arrays. For statically allocated arrays there is a way to get around this by adding a 16 byte .bss section after the .data section. – Z boson Oct 27 '14 at 12:58
2

@Zboson: And how do you know what you're comparing is not heap-allocated, and at the very end of the page, making any access beyond the `'\0'` an immediate pagefault? How do you know the system is, and ever will be, using page-based memory protection in the first place? – DevSolar Oct 27 '14 at 14:37
@DevSolar, I don't know. See Section "1.7 String instructions and safety precautions" in the asmlib manual. Agner discusses this in detail. For the heap he writes "It is recommended to allocate sufficient heap space if you are using dynamically allocated strings. A safer and more efficient solution is to allocate a memory pool of sufficient size and store multiple strings in the same memory pool". So GCC to be safe can't assume this but that does not mean faster solutions don't exist which can be used in most cases. You need to know when to be careful. – Z boson Oct 27 '14 at 14:45
2

@DevSolar you can safely make those assumptions, and you can't go onto the next page with aligned reads. – harold Oct 27 '14 at 17:17
2

This is a bogus argument. Optimized C library `strcmp` can use SIMD. It's harder to do safely with two pointers that could be misaligned relative to each other, though. See [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64). Answer: yes in ASM, dicey in C because of UB for reading outside objects. The compiler is emitting asm directly so it knows what is safe. – Peter Cordes Sep 28 '17 at 14:27
`repz cmpsb` is not terrible for very short strings, and saves a call, but I don't think it's a good choice if one of the strings is known to be very long. – Peter Cordes Sep 28 '17 at 14:29
BTW, [Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?](https://stackoverflow.com/a/55589634) is basically the same GCC bug, inlining slow `repe scasb` at `-O1` which leads to disaster for long strings. That bug is now fixed: GCC calls the libc functions. – Peter Cordes May 17 '22 at 13:30

Z boson · Answer 2 · 2014-10-28T10:39:20.903

18

GCC in this case is using a builtin strcmp. If you want it to use the version from glibc use -fno-builtin. But you should not assume that GCC's builtin version of strcmp or glibc's implementaiton of strcmp are efficient. I know from experience that GCC's builtin memcpy and glibc's memcpy are not as efficient as they could be.

I suggest you look at Agner Fog's asmlib. He has optimized several of the standard library functions in assembly. See the file strcmp64.asm. This has two version: a generic version for CPUs without SSE4.2 and a version for CPUs with SSE4.2. Here is the main loop for the SSE4.2 version

compareloop:
        add     rax, 16                ; increment offset
        movdqu  xmm1, [rs1+rax]        ; read 16 bytes of string 1
        pcmpistri xmm1, [rs2+rax], 00011000B ; unsigned bytes, equal each, invert. returns index in ecx
        jnbe    compareloop            ; jump if not carry flag and not zero flag

For the generic version he writes

This is a very simple solution. There is not much gained by using SSE2 or anything complicated

Here is the main loop of the generic version:

_compareloop:
        mov     al, [ss1]
        cmp     al, [ss2]
        jne     _notequal
        test    al, al
        jz      _equal
        inc     ss1
        inc     ss2
        jmp     _compareloop

I would compare the performance of GCC's builtin strcmp , GLIBC's strcmp and the asmlib strcmp. You should look at the disassembly to make sure that you get the builtin code. For example GCC's memcpy does not use the builtin version from sizes larger than 8192.

Edit: In regards to the the string length, Agner's SSE4.2 version reads up to 15 bytes beyond the of the string. He argues this is rarely a problem since nothing is written. It's not a problem for stack allocated arrays. For statically allocated arrays it could be a problem for memory page boundaries. To get around this he adds 16 bytes to the .bss section after the .data section. For more details see the section 1.7 String instructions and safety precautions in the manaul of the asmlib.

edited Oct 28 '14 at 10:39

answered Oct 27 '14 at 11:56

Z boson

32,619
11
123
226

Comparing 16 bytes at a time is a real win, even if you have to check for both null and the actual string. The beauty of it is that once you have made the code you can use it "forever" and benefit in new programs. – Surt Oct 27 '14 at 12:14
1

@Surt, can you say a little more of what you mean? Are you saying that the generic version can be done better or just that the SSE4.2 version is the right idea? – Z boson Oct 27 '14 at 12:20
I would think the non-SSE loop could be improved by subtracting one pointer from the other before the start and then using an indexed addressing mode, or does that optimization not help on newer processors? – supercat Oct 27 '14 at 22:12
@supercat, I don't know. From theory one would have to look at the latency and throughput and how many cycles it takes. For example if one port always needs two cycles and the rest one then adding another instruction to another port might not make a difference. This could be checked with [IACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer). But I have never profiled `strcmp` so I only know what I read. – Z boson Oct 28 '14 at 08:27
A byte-at-a-time loop can't be optimal! [glibc's sse2 `strcmp`](https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S.html) checks for page-crossing and then uses unaligned loads. It is of course fully safe, never reading from a page that doesn't contain any part of either string, because anything else would be unacceptable in a standard library. Having even less overhead is certainly possible if you can arrange to skip those checks, but it looks pretty reasonable. It uses `pcmpeqb` and `pminub` to check for non-match or `0` (end of string). – Peter Cordes Sep 28 '17 at 15:36
1

BTW, [Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?](https://stackoverflow.com/a/55589634) is basically the same GCC bug, inlining slow `repe scasb` at `-O1` which leads to disaster for long strings. That bug is now fixed: GCC calls the libc functions. (Which use `pcmpeqb` or the AVX2 version, not `pcmpistri`. **`pcmpistri` is not faster than SSE2 for strlen or strcmp**.) – Peter Cordes May 17 '22 at 13:33

supercat · Answer 3 · 2014-10-28T19:56:22.670

7

When the standard library for C was designed, the implementations of string.h methods that were most efficient when dealing with large amounts of data would be reasonably efficient for small amounts, and vice versa. While there may be some string-comparison scenarios were sophisticated use of SIMD instructions could yield better performance than a "naive implementation", in many real-world scenarios the strings being compared will differ in the first few characters. In such situations, the naive implementation may yield a result in less time than a "more sophisticated" approach would spend deciding how the comparison should be performed. Note that even if SIMD code is able to process 16 bytes at a time and stop when a mismatch or end-of-string condition is detected, it would still have to do additional work equivalent to using the naive approach on the last 16 characters scanned. If many groups of 16 bytes match, being able to scan through them quickly may benefit performance. But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison.

Incidentally, another potential advantage of the "naive" approach is that it would be possible to define it inline as part of the header (or a compiler might regard itself as having special "knowledge" about it). Consider:

int strcmp(char *p1, char *p2)
{
  int idx=0,t1,t2;
  do
  {
    t1=*p1; t2=*p2;
    if (t1 != t2)
    {
      if (t1 > t2) return 1;
      return -1;
    }
    if (!t1)
      return 0;
    p1++; p2++;
  } while(1);
}
...invoked as:
if (strcmp(p1,p2) > 0) action1();
if (strcmp(p3,p4) != 0) action2();

While the method would be a little big to be in-lined, in-lining could in the first case allow a compiler to eliminate the code to check whether the returned value was greater than zero, and in the second eliminate the code which checked whether t1 was greater than t2. Such optimization would not be possible if the method were dispatched via indirect jump.

edited Oct 28 '14 at 19:56

answered Oct 27 '14 at 18:10

supercat

77,689
9
166
211

Although what you say sounds reasonable you don't provide any evidence that " But in cases where the first 16 bytes don't match, it would be more efficient to just start with the character-by-character comparison." In fact when I look at the preamble and epilogue of the main loops of the SSE4.2 and the generic function in the _asmlib_ they are almost identical. I would not be surprised if the SSE4.2 version is never slower than the generic version even for arrays less than or equal to 16 bytes. – Z boson Oct 28 '14 at 08:20
@Zboson: I wasn't aware of the SSE 4.2 additions; string-compare code could benefit from earlier SSE functions but not as greatly. If library code were being compiled for 4.2 only, I could see the string functions as a potential "unconditional win", but if code had to check for their availability before executing them such a check would be a 100% loss on processors which don't support them (which I understand are still quite numerous) and even on processors which do might not come out ahead when strings differ in the first character. – supercat Oct 28 '14 at 15:21
Using a [CPU dispatcher](https://stackoverflow.com/questions/23676426/disable-avx2-functions-on-non-haswell-processors/23677889#23677889) the `strcmp` function only has to check the CPUID the first time it's called. Every call after that will jump directly to the SSE4.2 or the generic version. This is how the asmlib does it. So there is only an overhead on the first call. You could also call an init function for a library with all function you are replacing which makes all the function pointers point to the appropriate version. – Z boson Oct 28 '14 at 19:00
@Zboson: A CPU dispatcher would add an indirect jump; I know that's a much smaller penalty than it used to be, but I don't know how small. Further, strcmp() is small enough that an aggressive in-lining optimizer might possibly tackle it (if a header were defined such that it could do so). See edit above. – supercat Oct 28 '14 at 19:59
That's a good point. Actually, GCC already inlines `strcmp` in the OPs example. Like I said in my answer, it would be interesting to profile and compare the inline, glibc, and asmlib methods. I'm not going to do it anytime soon though... – Z boson Oct 28 '14 at 20:07

Roman · Answer 4 · 2016-08-05T10:52:09.430

4

Making an SSE2 version of strcmp was an interesting challenge for me.
I don't really like compiler intrinsic functions because of code bloat, so I decided to choose auto-vectorization approach. My approach is based on templates and approximates SIMD register as an array of words of different sizes.

I tried to write an auto-vectorizing implementation and test it with GCC and MSVC++ compilers.

So, what I learned is:
1. GCC's auto-vectorizer is good (awesome?)
2. MSVC's auto-vectorizer is worse than GCC's (doesn't vectorize my packing function)
3. All compilers declined to generate PMOVMSKB instruction, it is really sad

Results:
Version compiled by online-GCC gains ~40% with SSE2 auto-vectorization. On my Windows machine with Bulldozer-architecture CPU auto-vectorized code is faster than online compiler's and results match the native implementation of strcmp. But the best thing about the idea is that the same code can be compiled for any SIMD instruction set, at least on ARM & X86.

Note:
If anyone will find a way to make compiler to generate PMOVMSKB instruction then overall performance should get a significant boost.

Command-line options for GCC: -std=c++11 -O2 -m64 -mfpmath=sse -march=native -ftree-vectorize -msse2 -march=native -Wall -Wextra

Links:
Source code compiled by Coliru online compiler
Assembly + Source code (Compiler Explorer)

@PeterCordes, thanks for the help.

edited Aug 05 '16 at 10:52

answered Aug 01 '16 at 19:51

Roman

316
2
9

You can and should make the godbolt link go directly to your code on godbolt, instead of expecting people to copy/paste for you. SSE2 is portable to any modern x86. It's only a mess if you write messy code. – Peter Cordes Aug 01 '16 at 23:01
@PeterCordes I've tried to make a short link to my code, but I just couldn't make one. I've always ended up with a 200-symbol link that would take half of the post, and Google's URL shortener didn't help either, it just doesn't recognize resulting URL as a valid. – Roman Aug 02 '16 at 05:52
Only use godbolt short-links for comments, [because full links can't rot](http://meta.stackoverflow.com/a/319594/224132). When there's no relevant char-limit, just make full links that don't display in the text of your post. (godbolt shortlinks already use goo.gl internally. Oh that's weird, I think the shortlink button might be broken. I hardly ever use it.) – Peter Cordes Aug 02 '16 at 14:15
It doesn't look like g++ actually auto-vectorized your code at `-O3` (which includes `-ftree-vectorize`). I only see a `pcmpeqd`, which it uses to generate a vector of all-ones for storing to the stack. I see a lot of byte-at-a-time copying and comparing, and testing bitfields. If that's faster than the builtin `strcmp`, I wonder just how bad it is... – Peter Cordes Aug 02 '16 at 14:23
@PeterCordes Yep, gcc starts to mess up the code badly at -O3. – Roman Aug 02 '16 at 15:23
Wow, that's weird. I've never seen gcc auto-vectorize something with `-O2 -ftree-vectorize`, but not at `-O3`. That's not a good sign for the quality / efficiency of your code. There weren't any high-level comments in it documenting the overall approach, so I didn't really try to read all that mass of templated C++. The vectorized asm output looks like it's emulating [`PMOVMSKB`](http://www.felixcloutier.com/x86/PMOVMSKB.html) after the vector `pcmpeqb` by storing and then doing byte loads + shifts. That's insanely horrible :( – Peter Cordes Aug 02 '16 at 15:35
@PeterCordes I have updated link to Compiler Explorer in the post with my command-line, please check it out. – Roman Aug 02 '16 at 15:38
Why did you leave out `-Wall -Wextra`? But yeah, that's the asm I was commenting on in my last comment. Vectorized compare, but then packing a bit from each compare-result byte one at a time into a GP register. – Peter Cordes Aug 02 '16 at 15:39
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/118928/discussion-between-romu4-and-peter-cordes). – Roman Aug 02 '16 at 15:41
@PeterCordes I implemented a slightly different micro-benchmark and changed the way registers were compared to zero. Now main vectorized loop consists of SSE instructions almost entirely (few test instructions present). Could you check the new code, please? – Roman Aug 04 '16 at 15:47
It looks ok, but the horizontal OR down to a single byte takes a lot of steps. If you could horizontal OR down to 4 or 8 bytes, and check those as an int64_t, the compiler could shorten that chain of `vpsrldq` / `vpor` dramatically. (just one then `vmovq rax,xmm0`). (BTW, on godbolt you should probably use `-march=haswell` or something instead of `-march=native`. Because that will change depending on what hardware godbolt.org is running on. Note that at the moment you're targeting AVX2 because of `-march=native`. It doesn't look like your code uses any 256b integer instructions, though.) – Peter Cordes Aug 04 '16 at 15:57
@PeterCordes '-march=haswell' doesn't change anything in the assembly. Auto-vectorizer works on arrays, because my "register" is a fixed-size 16-byte array it can't use AVX. I tried to compact result in one register and then do a single compare, but it ended in a massive code bloat. You can check out results here, modified version on the right part: [Screenshot](http://i.imgur.com/5GwGpyF.png) – Roman Aug 04 '16 at 16:12
Yes, godbolt currently runs on a Haswell server, so `-march=haswell` is the same as `-march=native` *now*. But in 5 years, it might include AVX512. And yes, your code is using the 3-operand non-destructive AVX versions of SSE2 instructions. That's why `vpsrldq` can put its output in a different register, instead of shifting in-place. (Avoiding a lot of `movdqa` insns). You might want to use `-mtune=haswell` to tune for Haswell without making code that doesn't work on older CPUs. – Peter Cordes Aug 04 '16 at 16:45
That sucks when gcc decides to widen byte vectors to quadwords with `pmovzx`. :/ I'm not sure how best to write code that would get gcc to emit one `vpsrldq` / `vpor` and then a `vmovq rax,xmm0` / `test rax,rax`, but that would get the job done more efficiently than ORing all the way down to a single byte and extracting it with `vpextrb`. Maybe use a `union { char b[16]; int64_t q[2]; };` or something, so you can access a range of elements? – Peter Cordes Aug 04 '16 at 16:49
@PeterCordes I have inserted a type cast to interpret simd-register as an array of CPU words (64-bit words for 64-bit CPU). Compiler did a pretty neat code cleanup, only 3 instructions to test for zero now and zero bit-shifting. Updated links to the code, so you could check this out. – Roman Aug 05 '16 at 09:55
Is it faster? The vector loop is now *huge*, AFAICT, from `.L5` to the `jmp .L5`. There's some scalar stuff inside that it jumps over, and maybe 2 unrolled iterations (one just after .L5 and one just after .L7). It's doing the 64bit OR with a `movq` and a store/reload to L1. I think it's doing so much storing/reloading that it's going to be a problem for throughput. (The latency isn't very important for strings with the only difference very far from the front, since checking each block of 16B should be a separate dep chain. The extra latency does make the final branch mispredict worse.) – Peter Cordes Aug 05 '16 at 14:26
Also, be careful with type-punning by `reinterpret_cast`ing a pointer. The language's aliasing rules do guarantee that `char*` can alias anything, which is why it works in this case. This is why I recommended a union (because that is guaranteed to always work between any types in GNU C++, although still not in ISO C++. I think the ISO rules are the same in C and C++, and the only "safe" way to type-pun is with memcpy (or other `char*`-based method. But nobody wants that so people use unions in practice.)) – Peter Cordes Aug 05 '16 at 14:33
@PeterCordes honestly, I'am struggling with benchmark now, always getting strange results. The new version seems to be a little bit faster. But I think we need a better micro-benchmark to see real results. – Roman Aug 05 '16 at 14:40
Your benchmark is still meaningless because you only time one call to each. Warm-up effects will dominate, especially for the library function (because of lazy dynamic linking). Even cache-misses `chrono::nanoseconds` are probably significant. You need to create a loop which actually makes multiple calls to the same compare function with the same args many times. I'm not interested enough to do it for you or dig up a benchmark framework you could use, but you should watch this presentation to learn more about benchmarking / profiling: https://www.youtube.com/watch?v=nXaxk27zwlk – Peter Cordes Aug 05 '16 at 18:51
@PeterCordes I actually did what you have wrote, I'am not that stupid to benchmark functions once without a reason. It is much harder to make a decent micro-benchmark with multiple calls since loop that calls the same function with the same arguments gets optimized away. I tried to solve that with a new variable, bitset that contains results of all of the 32 runs, but then results are wrong: my function gets 20-30 times slower than native strcmp and it's analogs. Then I added flush_cache() call after each iteration of the loop. After that all functions had equal run time. – Roman Aug 06 '16 at 04:42
I was commenting based on the `main()` that's still included in your Godbolt link. I tried it on my desktop and got ~300 vs. ~900, IIRC. I didn't look at coliru because running it on an online IDE is no good when you're measuring wall-clock time with [`chrono::high_resolution_clock`](http://en.cppreference.com/w/cpp/chrono/high_resolution_clock) (rather than "user" CPU time) on a non-idle system. – Peter Cordes Aug 06 '16 at 04:49

score 3 · Answer 5 · answered Oct 27 '14 at 11:13

3

I suspect there's simply no point in SIMD versions of library functions with very little computation. I imagine that functions like strcmp, memcpy and similiar are actually limited by the memory bandwidth and not the CPU speed.

answered Oct 27 '14 at 11:13

Frerich Raabe

90,689
19
115
207

2

It's only limited by memory bandwidth if the array does not fit in the cache. You could use `memcpy` or `strcmp` in a tight inner loop not limited by memory bandwidth. – Z boson Oct 27 '14 at 13:01
https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html (The optimization manual) spends a fair bit of time on memcpy implementation techniques, and performance on Ivy Bridge and later CPUs where `REP STOSB` triggers a microcoded high-performance `memset`, but has higher startup overhead than a 128b or 256b wide SSE/AVX implementation. See section 3.7.5, or search for memcpy. There's a CPUID feature bit for ERMSB (Enhanced Rep Movsb and StosB). – Peter Cordes May 01 '15 at 06:23

score 3 · Answer 6 · answered Oct 27 '14 at 11:20

3

It depends on your implementation. On MacOS X, functions like memcpy, memmove and memset have implementations that are optimised depending on the hardware you are using (the same call will execute different code depending on the processor, set up at boot time); these implementations use SIMD and for big amounts (megabytes) use some rather fancy tricks to optimise cache usage. Nothing for strcpy and strcmp as far as I know.

Convincing the C++ standard library to use that kind of code is difficult.

answered Oct 27 '14 at 11:20

gnasher729

51,477
5
75
98

4

Can you show some optimized mac disassembly for `strcmp`? – user1095108 Oct 27 '14 at 11:27
Yeah, according to Agner Fog when he looked into this the MacOS X versions were pretty good but the Linux and Windows versions were inefficient. So probably the MacOS X assembly is interesting. – Z boson Oct 27 '14 at 12:35

Avlin · Answer 7 · 2015-10-20T10:29:10.577

AVX 2.0 would be faster actually

Edit: It is related to registers and IPC

Instead of relying on 1 big instruction, you can use a plethora of SIMD instructions with 16 registers of 32 bytes, well, in UTF16 you it gives you 265 chars to play with !

double that with avx512 in few years!

AVX instructions also do have high throughput.

According this blog: https://blog.cloudflare.com/improving-picohttpparser-further-with-avx2/

Today on the latest Haswell processors, we have the potent AVX2 instructions. The AVX2 instructions operate on 32 bytes, and most of the boolean/logic instructions perform at a throughput of 0.5 cycles per instruction. This means that we can execute roughly 22 AVX2 instructions in the same amount of time it takes to execute a single PCMPESTRI. Why not give it a shot?

Edit 2.0 SSE/AVX units are power gated, and mixing SSE and/or AVX instructions with regular ones involves a context switch with performance penalty, that you should not have with the strcmp instruction.

score -3 · Answer 8 · answered Oct 27 '14 at 11:29

-3

I don't see the point in "optimizing" a function like strcmp.

You will need to find the length of the strings before applying any kind of parallel processing, which will force you to read the memory at least once. While you're at it, you might as well use the data to perform the comparison on the fly.

If you want to do anyting fast with strings, you will need specialized tools like finite state machines (lexx comes to mind for a parser).

As for C++ std::string, they are inefficient and slow for a large number of reasons, so the gain of checking length in comparisons is neglectible.

answered Oct 27 '14 at 11:29

kuroi neko

8,479
1
19
43

I've used lex and bison before and they generate a lot of code. It is (usually) valid c++, but efficient? I somehow doubt it. Anyhow parsing was not the relevant topic in my question. – user1095108 Oct 27 '14 at 11:38
3

better performance => lower energy use => longer battery duration => happy environmental conscious customer. – Surt Oct 27 '14 at 11:56
1

std::string maybe inefficient in some ways but they're good for determining string length – phuclv Oct 28 '14 at 10:02
The usual crowd of C++ zealots is barking at the blasphemator... Surt comment is irrelevant in the extreme ; you should stay in your marketing office and let the programmers do the talking, pal. As for you, Mr. OP, if you're afraid of the code generated by lexx, you should be terrified by SIMD instructions. They are much more difficult to master (I mean using them is easy enough, but writing code that actually runs better is another story entierely). – kuroi neko Oct 30 '14 at 11:35

Why is strcmp not SIMD optimized?

8 Answers8