6

Our internal program is written in C and makes extensive use of snprintf() for many pieces, and I noticed that during debugging with perf record/report, it's spending a lot of time on the following:

       │      _IO_vfprintf_internal():                                                                                                                                                                                             ▒
       │        mov    -0x510(%rbp),%rdx                                                                                                                                                                                           ▒
       │        mov    %r12,%rsi                                                                                                                                                                                                   ▒
       │        mov    %r15,%rdi                                                                                                                                                                                                   ▒
       │      → callq  *0x38(%rax)                                                                                                                                                                                                 ▒
       │        cmp    %rax,-0x510(%rbp)                                                                                                                                                                                           ▒
       │        mov    -0x530(%rbp),%r9                                                                                                                                                                                            ▒
       │      ↑ jne    91a                                                                                                                                                                                                         ▒
       │        mov    -0x4d0(%rbp),%esi                                                                                                                                                                                           ▒
       │        mov    -0x540(%rbp),%ecx                                                                                                                                                                                           ▒
       │        mov    $0x7fffffff,%eax                                                                                                                                                                                            ▒
       │        sub    %esi,%eax                                                                                                                                                                                                   ▒
       │        add    %esi,%ecx                                                                                                                                                                                                   ▒
       │        cltq                                                                                                                                                                                                               ▒
       │        cmp    %rax,-0x510(%rbp)                                                                                                                                                                                           ▒
       │      ↑ jbe    252b                                                                                                                                                                                                        ▒
       │      ↑ jmpq   28f0                                                                                                                                                                                                        ▒
       │4a70:   xor    %eax,%eax                                                                                                                                                                                                   ▒
       │        or     $0xffffffffffffffff,%rcx                                                                                                                                                                                    ▒
       │        mov    %r12,%rdi                                                                                                                                                                                                   ▒
 99.52 │        repnz  scas %es:(%rdi),%al

It seems regardless of the GCC version/glibc, I've tried GCC 4.8.5 all the way to GCC 9, -O2 to -O3, etc. I've even used a sandbox with glibc from -HEAD.

Edit: I found a test program that demonstrates this problem quite well with perf:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

char dst[20000];

static void test_strcpy(int i) {
    size_t len = sizeof(dst);
    char* src = (char*)malloc(len + 1);
    memset(src, 'a', len);
    src[len] = 0;

    while (i--)
        strcpy(dst, src);

    free(src);
}

static void test_strncat(int i) {
    size_t len = sizeof(dst);
    char* src = (char*)malloc(len + 1);
    memset(src, 'a', len);
    src[len] = 0;

    while (i--) {
    dst[0] = 0;
        strncat(dst, src, sizeof(dst));
    }

    free(src);
}

#define BENCH(FN, ARG)                          \
do {                                    \
    struct timeval start, end;                      \
    long sec, msec;                         \
    gettimeofday(&start, NULL);                     \
    FN(ARG);                                \
    gettimeofday(&end, NULL);                       \
    sec = end.tv_sec - start.tv_sec;                    \
    msec = end.tv_usec - start.tv_usec;                 \
    if (msec < 0) {                         \
    --sec;                              \
    msec += 1000000;                        \
    }                                   \
    printf("%13s(%5d): %2ld sec %6ld msec\n", #FN, ARG, sec, msec); \
} while (0)

static void test_snprintf(int i) {
    size_t len = sizeof(dst);
    char* src = (char*)malloc(len + 1);
    memset(src, 'a', len);
    src[len] = 0;

    while (i--)
        snprintf(dst, sizeof(dst), "%s", src);

    free(src);
}

int main(int argc, char *argv[]) {
int i;
for (i = 0; i < 10; i++) {
    BENCH(test_strcpy, 10);
    BENCH(test_strcpy, 100);
    BENCH(test_strcpy, 1000);
    BENCH(test_strcpy, 10000);

    BENCH(test_strncat, 10);
    BENCH(test_strncat, 100);
    BENCH(test_strncat, 1000);
    BENCH(test_strncat, 10000);

    BENCH(test_snprintf, 10);
    BENCH(test_snprintf, 100);
    BENCH(test_snprintf, 1000);
    BENCH(test_snprintf, 10000);
    printf("\n");
}
    return 0;
}

It seems as if repnz scasb is the culprit, regardless of the string size. I've read that repnz scasb shouldn't be used for small strings because the setup cost on silicon is quite high.

ecm
  • 2,583
  • 4
  • 21
  • 29
user1016031
  • 123
  • 1
  • 7
  • 1
    `_IO_vfprintf_internal` is in library code; are you really recompiling *glibc* every time with different optimization options? See also [Why is this code 6.5x slower with optimizations enabled?](https://stackoverflow.com/q/55563598) re: how old GCC optimizes strlen calls badly, by inlining a slow `repne scasb`. Presumably your glibc was built with an old GCC. – Peter Cordes Oct 29 '20 at 23:03
  • gcc is the compiler not the C library. are you statically linking? – old_timer Oct 29 '20 at 23:13
  • 1
    Yes, I've tried 3 different glibc versions (rebuilt them in a sandbox) with many different gcc versions. – user1016031 Oct 29 '20 at 23:13
  • I edited the post to reflect exactly what I have tried. – user1016031 Oct 29 '20 at 23:15
  • 3
    printf/sprintf/snprintf, etc are expected to be slow – old_timer Oct 29 '20 at 23:15
  • 3
    `printf` et. al. is naturally slow--that is its nature. It's fine for error messages, but shouldn't be used in time/mission critical inner loops. _If_ you're using it as a convenient way to do multiple concatenations (e.g. `sprintf(buf,"%s/%s/%s",head,mid,tail);`), you may want to use `strcpy/strcat`. In fact, [the non-standard, GNU specific] `stpcpy` can be faster (e.g.) `char buf[1000]; char *bp = buf; bp = stpcpy(bp,head); *bp++ = '/'; bp = stpcpy(bp,mid); *bp++ = '/'; bp = stpcpy(bp,tail);` – Craig Estey Oct 29 '20 at 23:23
  • 1
    Actually, looking at the man page, `stpcpy` was standardized since POSIX.1-2008, so it should be fine. It's been in `glibc` since 1992. It is _heavily_ optimized to be super fast, using (e.g.) x86 SIMD instructions, etc. – Craig Estey Oct 29 '20 at 23:31
  • When you say "slow", how slow are we talking? Could you post the output of your demo program, so that we can compare? Spending most of your time in the repnz isn't unexpected, as that is the instruction that does all the byte copying work. – Snild Dolkow Oct 31 '20 at 05:40
  • @SnildDolkow I'm not the OP, but from some quick testing it's ten or twelve times slower, even with a dumb-simple format like `snprintf(b, sizeof b, "%s", s)`. Which is a high shame, because the "alternatives" (`strncpy`, `strlcpy`) all suck. –  Nov 01 '20 at 09:35

1 Answers1

5

GCC bug 88809 is closed, fixed for GCC10, affecting GCC9 and earlier.

GCC10 and later should no longer expand strlen inline with slow crap like repnz scasb when compiling glibc. (See Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? for details of how strlen compiled in affected GCC versions. The version you show using repnz scasb looks like the -O1 worst case.)


snprintf is still not fast in general (lots of overhead parsing the format string and passing varargs through wrapper functions). But with that bugfix, it should hopefully scale similarly to large strings as strcpy, or at least as strlen + memcpy which makes two passes over the data. So with large inputs that lead to spending most of its time in strlen and/or memcpy, it will at least be doing that efficiently. Probably with the time evenly split between strlen and memcpy, if it internally uses memcpy instead of a hand-written loop. Instead of 99% in strlen like you found with perf record, unless that was only counting sample hits in this function itself, not in calls to memcpy.

Your benchmark test-cases with medium to large strings like 1000 and 10k bytes will still spend most of their time on that, not parsing format strings and varargs / wrapper function overhead, but for small sizes that overhead will dominate vs. strncpy that just goes straight into copying, finding the length along the way.


As pointed out in comments, your strncat loop is an anti-pattern. Building up a large string that way re-scans the already-copied part every time, giving O(N * M) run time to concatenate N strings of length M.

Or in your case, once you fill up the destination, then it just costs O(dstsize) time per call to basically memchr it for a terminating zero.

To avoid this, use strlen and memcpy yourself to keep track of the point to append to. Or if you can use POSIX-2008 functions, use stpcpy which returns the end of the copy, the place where you want to start copying the next. Unfortunately this doesn't give an easy way to do bounds checks.

stpcpy should be as fast as strcpy in glibc, I assume using the same hand-written asm. It's been part of glibc since 1992.

See also strcpy() return value about the poor API design of ISO C string functions that date back very early. (Separate from the strncpy design that makes it suck for avoiding buffer overflows, not forcing the destination to be zero-terminated.)

ecm
  • 2,583
  • 4
  • 21
  • 29
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    `stpcpy` is fast in GLIBC (on x86 at least). Also note you can essentially get the desired `strncpy` behavior with `*dst = 0; strncat(dst, src, n)`. Better if you can put the `*dst` way earlier to avoid a SF stall (maybe worth optimizing that use case. Might look into it if the `strlcpy` proposal fails). `strncat` doesn't carry the `memset` baggage that `strncpy` does (often `strnlen` + `memcpy` is faster than `strncpy` if len is expected to be geater than null-term because if len is larger than null-term `strncpy` becomes essentially `strlen` + `memcpy` + `memset` with extra overhead). – Noah Apr 19 '22 at 00:09