25

I have been trying experiment with improving performance of strcmp under certain conditions. However, I unfortunately cannot even get an implementation of plain vanilla strcmp to perform as well as the library implementation.

I saw a similar question, but the answers say the difference was from the compiler optimizing away the comparison on string literals. My test does not use string literals.

Here's the implementation (comparisons.cpp)

int strcmp_custom(const char* a, const char* b) {
    while (*b == *a) {
        if (*a == '\0') return 0;
        a++;
        b++;
    }
    return *b - *a;
}

And here's the test driver (driver.cpp):

#include "comparisons.h"

#include <array>
#include <chrono>
#include <iostream>

void init_string(char* str, int nChars) {
    // 10% of strings will be equal, and 90% of strings will have one char different.
    // This way, many strings will share long prefixes so strcmp has to exercise a bit.
    // Using random strings still shows the custom implementation as slower (just less so).
    str[nChars - 1] = '\0';
    for (int i = 0; i < nChars - 1; i++)
        str[i] = (i % 94) + 32;

    if (rand() % 10 != 0)
        str[rand() % (nChars - 1)] = 'x';
}

int main(int argc, char** argv) {
    srand(1234);

    // Pre-generate some strings to compare.
    const int kSampleSize = 100;
    std::array<char[1024], kSampleSize> strings;
    for (int i = 0; i < kSampleSize; i++)
        init_string(strings[i], kSampleSize);

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < kSampleSize; i++)
        for (int j = 0; j < kSampleSize; j++)
            strcmp(strings[i], strings[j]);

    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "strcmp        - " << (end - start).count() << std::endl;

    start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < kSampleSize; i++)
        for (int j = 0; j < kSampleSize; j++)
            strcmp_custom(strings[i], strings[j]);

    end = std::chrono::high_resolution_clock::now();
    std::cout << "strcmp_custom - " << (end - start).count() << std::endl;
}

And my makefile:

CC=clang++

test: driver.o comparisons.o
    $(CC) -o test driver.o comparisons.o

# Compile the test driver with optimizations off.
driver.o: driver.cpp comparisons.h
    $(CC) -c -o driver.o -std=c++11 -O0 driver.cpp

# Compile the code being tested separately with optimizations on.
comparisons.o: comparisons.cpp comparisons.h
    $(CC) -c -o comparisons.o -std=c++11 -O3 comparisons.cpp

clean:
    rm comparisons.o driver.o test

On the advice of this answer, I compiled my comparison function in a separate compilation unit with optimizations and compiled the driver with optimizations turned off, but I still get a slowdown of about 5x.

strcmp        - 154519
strcmp_custom - 506282

I also tried copying the FreeBSD implementation but got similar results.

I'm wondering if my performance measurement is overlooking something. Or is the standard library implementation doing something fancier?

Charles
  • 1,384
  • 11
  • 18
kevinAlbs
  • 1,114
  • 2
  • 11
  • 20
  • 6
    I think stdlib has this function implemented in assembly, probably using some assembly-related tricks (though I can't give any examples). – ForceBru Jul 02 '17 at 20:10
  • If the library implementation performed as poorly as the vanilla implementation, I would consider it broken. – harold Jul 02 '17 at 20:13
  • For example: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 – Matt Timmermans Jul 02 '17 at 20:14
  • I think that you can achieve similar performance with using SSE intrinsics, you don't have to resort to asm. Usually asm is used to squeeze out the last 10-20% performance. – geza Jul 02 '17 at 20:18
  • 6
    In higher optimization levels some fundamental functions like memcpy are not even linked in from a library but rather replaced with compiler-internal versions, if you want to prevent this behavior for gcc try [`-fno-builtin`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#Other-Builtins) – PeterT Jul 02 '17 at 20:24
  • @PeterT `-fno-builtin` doesn't seem to effect my case, but that's good to know for future reference – kevinAlbs Jul 02 '17 at 22:51
  • Why you only use -O3 in one of your compilation lines, and the other -O0 – Arnaud Aliès Jul 03 '17 at 08:29
  • I would expect the main improvement in the standard library to come from comparing 4 or even 8 bytes at a time, drastically reducing the amount of memory access needed. – fishinear Jul 03 '17 at 10:39

2 Answers2

38

I don't know which standard library you have, but just to give you an idea of how serious C library maintainers are about optimizing the string primitives, the default strcmp used by GNU libc on x86-64 is two thousand lines of hand-optimized assembly language, as of version 2.24. There are separate, also hand-optimized, versions for when the SSSE3 and SSE4.2 instruction set extensions are available. (A fair bit of the complexity in that file appears to be because the same source code is used to generate several other functions; the machine code winds up being "only" 1120 instructions.) 2.24 was released roughly a year ago, and even more work has gone into it since.

They go to this much trouble because it's common for one of the string primitives to be the single hottest function in a profile.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • Wow, didn't know people were getting so crazy about this seemingly simple function... – ForceBru Jul 02 '17 at 20:15
  • 7
    That is really interesting and it does make sense that the library authors went to such lengths. I guess in general it's probably not worth reimplementing library functions yourself unless there's a really good reason. – kevinAlbs Jul 02 '17 at 22:17
  • 2
    The main optimizations are using vector instructions if available and comparing more than one byte at time if both addresses are suitably aligned. – Martin Bonner supports Monica Jul 03 '17 at 09:04
  • Trying to out optimize something included in a standard lib is a losing battle for most of us. =) +1 – jpmc26 Jul 03 '17 at 09:19
9

Excerpts from my disassembly of glibc v2.2.5, x86_64 linux:

0000000000089cd0 <strcmp@@GLIBC_2.2.5>:
   89cd0:   48 8b 15 99 a1 33 00    mov    0x33a199(%rip),%rdx        # 3c3e70 <_IO_file_jumps@@GLIBC_2.2.5+0x790>
   89cd7:   48 8d 05 92 58 01 00    lea    0x15892(%rip),%rax        # 9f570 <strerror_l@@GLIBC_2.6+0x200>
   89cde:   f7 82 b0 00 00 00 10    testl  $0x10,0xb0(%rdx)
   89ce5:   00 00 00 
   89ce8:   75 1a                   jne    89d04 <strcmp@@GLIBC_2.2.5+0x34>
   89cea:   48 8d 05 9f 48 0c 00    lea    0xc489f(%rip),%rax        # 14e590 <__nss_passwd_lookup@@GLIBC_2.2.5+0x9c30>
   89cf1:   f7 82 80 00 00 00 00    testl  $0x200,0x80(%rdx)
   89cf8:   02 00 00 
   89cfb:   75 07                   jne    89d04 <strcmp@@GLIBC_2.2.5+0x34>
   89cfd:   48 8d 05 0c 00 00 00    lea    0xc(%rip),%rax        # 89d10 <strcmp@@GLIBC_2.2.5+0x40>
   89d04:   c3                      retq
   89d05:   90                      nop
   89d06:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   89d0d:   00 00 00 
   89d10:   89 f1                   mov    %esi,%ecx
   89d12:   89 f8                   mov    %edi,%eax
   89d14:   48 83 e1 3f             and    $0x3f,%rcx
   89d18:   48 83 e0 3f             and    $0x3f,%rax
   89d1c:   83 f9 30                cmp    $0x30,%ecx
   89d1f:   77 3f                   ja     89d60 <strcmp@@GLIBC_2.2.5+0x90>
   89d21:   83 f8 30                cmp    $0x30,%eax
   89d24:   77 3a                   ja     89d60 <strcmp@@GLIBC_2.2.5+0x90>
   89d26:   66 0f 12 0f             movlpd (%rdi),%xmm1
   89d2a:   66 0f 12 16             movlpd (%rsi),%xmm2
   89d2e:   66 0f 16 4f 08          movhpd 0x8(%rdi),%xmm1
   89d33:   66 0f 16 56 08          movhpd 0x8(%rsi),%xmm2
   89d38:   66 0f ef c0             pxor   %xmm0,%xmm0
   89d3c:   66 0f 74 c1             pcmpeqb %xmm1,%xmm0
   89d40:   66 0f 74 ca             pcmpeqb %xmm2,%xmm1
   89d44:   66 0f f8 c8             psubb  %xmm0,%xmm1
   89d48:   66 0f d7 d1             pmovmskb %xmm1,%edx
   89d4c:   81 ea ff ff 00 00       sub    $0xffff,%edx
...

The real thing is 1183 lines of assembly, with lots of potential cleverness about detecting system features and vectorized instructions. libc maintainers know that they can get an edge by just optimizing some of the functions called thousands of times by applications.

For comparison, your version at -O3:

comparisons.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_Z13strcmp_customPKcS0_>:
int strcmp_custom(const char* a, const char* b) {
    while (*b == *a) {
   0:   8a 0e                   mov    (%rsi),%cl
   2:   8a 07                   mov    (%rdi),%al
   4:   38 c1                   cmp    %al,%cl
   6:   75 1e                   jne    26 <_Z13strcmp_customPKcS0_+0x26>
        if (*a == '\0') return 0;
   8:   48 ff c6                inc    %rsi
   b:   48 ff c7                inc    %rdi
   e:   66 90                   xchg   %ax,%ax
  10:   31 c0                   xor    %eax,%eax
  12:   84 c9                   test   %cl,%cl
  14:   74 18                   je     2e <_Z13strcmp_customPKcS0_+0x2e>
int strcmp_custom(const char* a, const char* b) {
    while (*b == *a) {
  16:   0f b6 0e                movzbl (%rsi),%ecx
  19:   0f b6 07                movzbl (%rdi),%eax
  1c:   48 ff c6                inc    %rsi
  1f:   48 ff c7                inc    %rdi
  22:   38 c1                   cmp    %al,%cl
  24:   74 ea                   je     10 <_Z13strcmp_customPKcS0_+0x10>
  26:   0f be d0                movsbl %al,%edx
  29:   0f be c1                movsbl %cl,%eax
        if (*a == '\0') return 0;
        a++;
        b++;
    }
    return *b - *a;
  2c:   29 d0                   sub    %edx,%eax
}
  2e:   c3                      retq   
Brian Cain
  • 14,403
  • 3
  • 50
  • 88