strlen performance implementation

Question

This is a multipurpose question:

How does this compare to the glibc strlen implementation?
Is there a better way to to this in general and for autovectorization.

#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <stdbool.h>
#include <stdlib.h>

/* Todo: Document */
#define WORD_ONES_LOW   ((size_t)-1 / UCHAR_MAX)
#define WORD_ONES_HIGH  (((size_t)-1 / UCHAR_MAX) << (CHAR_BIT - 1))

/*@doc
 * @desc: see if an arch word has a zero
 * #param: w - string aligned to word size
 */
static inline bool word_has_zero(const size_t *w)
{
    return ((*w - WORD_ONES_LOW) & ~*w & WORD_ONES_HIGH);
}

/*@doc
 * @desc: see POSIX strlen()
 * @param: s - string
 */
size_t strlen(const char *s)
{
    const char *z = s;

    /* Align to word size */
    for (; ((uintptr_t)s & (sizeof(size_t) - 1)) && *s != '\0'; s++);

    if (*s != '\0') {
        const size_t *w;

        for (w = (const size_t *)s; !word_has_zero(w); w++);
        for (s = (const char *)w; *s != '\0'; s++);
    }

    return (s - z);
}

Related [Why does glibc's strlen need to be so complicated to run quickly?](//stackoverflow.com/q/57650895) has an answer that addresses the type-punning UB, and links to glibc's hand-written asm that's actually used on most platforms. **Yes, x86 SIMD can do *much* better than this, doing 16 bytes more efficiently than this does 4 or 8, with no false positives.** — Peter Cordes, Oct 22 '19 at 19:37

AnT stands with Russia · Accepted Answer · 2012-08-03T03:24:49.870

Well, this implementation is based on virtually the same trick (Determine if a word has a zero byte) as the glibc implementation you linked. They do pretty much the same thing, except that in glibc version some loops are unrolled and bit masks are spelled out explicitly. The ONES and HIGHS from the code you posted is exactly himagic = 0x80808080L and lomagic = 0x01010101L form glibc version.

The only difference I see is that glibs version uses a slightly different criterion for detecting a zero byte

if ((longword - lomagic) & himagic)

without doing ... & ~longword (compare to HASZERO(x) macro in your example, which does the same thing with x, but also includes ~(x) member). Apparently glibc authors believed this shorter formula is more efficient. Yet it can result in false positives. So they check for false positives under that if.

It is indeed an interesting question, what is more efficient: a single-stage precise test (your code) or a two-stage test that begins with rough imprecise check followed, if necessary, by a precise second check (glibc code).

If you want to see how they compare in terms of actual performance - time them on your platform and your data. There's no other way.

Note that the comments in the glibc source file seem to be about the `#if 0`'d out code that's both slower and slightly less precise (only one false positive situation) than the above code. The code that's actually used (in the `#else` case) in the glibc source has false positives for **any** high byte except 128. Thus, my naive conclusion would be that the glibc code performs better for ASCII text but much, much worse for anything with high bytes. — R.. GitHub STOP HELPING ICE, Aug 03 '12 at 01:15

score 3 · Answer 2 · answered Aug 08 '12 at 02:05

3

Also, please note this implementation can read past the end of a char array here:

for (w = (const void *)s; !HASZERO(*w); w++);

and therefore relies on undefined behaviour.

answered Aug 08 '12 at 02:05

beerboy

1,304
12
12

how so? we align to word size with the initial for loop and then iterate per word for '\0' a char array must be null terminated or bad things will happen. I'll update it with a less obtuse code – nwmcsween Aug 08 '12 at 09:47
2

Example: an empty null terminated string is a char array of size 1. `*w` will read 4 or 8 bytes from memory, thereby accessing bytes past the end of the array. On most platforms this will work fine, but it is still undefined behavior. – beerboy Aug 13 '12 at 04:50
Re: undefined behavior: Another user has just posted a question about that; see http://stackoverflow.com/q/20312340/978917 . The consensus seems to be that this *is* undefined behavior, but that it's O.K. for the `strlen` that ships with a compiler to rely on only-compiler-guaranteed behavior. – ruakh Dec 01 '13 at 17:37
Another [Q&A about reading past the end of an object in C vs. assembly](https://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64). It's definitely UB in C, but if you align your pointer, you can in practice do it without faulting on implementations that check accesses with page granularity. (As long as the compiler doesn't reach any unfortunate conclusions while optimizing based on the assumption that the program contains no UB...) `strlen` implementations hand-written in asm definitely do this, but C isn't asm. – Peter Cordes Aug 19 '17 at 18:03

score 0 · Answer 3 · answered Aug 03 '12 at 01:21

0

To answer your second question, I think the naive byte-based strlen implementation will result in better auto-vectorization by the compiler, if it's smart and support for vector instruction set extensions (e.g. SSE) has been enabled (e.g. with -msse or an appropriate -march). Unfortunately, it won't result in any vectorization with baseline cpus which lack these features, even though the compiler could generate 32- or 64-bit pseudo-vectorized code like the C code cited in the question, if it were smart enough...

answered Aug 03 '12 at 01:21

R.. GitHub STOP HELPING ICE

208,859
35
376
711

Compilers still don't auto-vectorize loops where the iteration-count is unknown before loop-entry. So search-loops in general don't get auto-vectorized, unfortunately. – Peter Cordes Aug 19 '17 at 16:46
That's really unfortunate. – R.. GitHub STOP HELPING ICE Aug 19 '17 at 17:18
IDK if they choose not to because of things like valgrind that would complain about reading past the end of arrays. That can't be the only reason, though, because you still don't get autovectorization when searching in a static array (or a function arg passed as `int arr[static 1024]`). Just checked a really simple search of an `int` array with gcc and clang: https://godbolt.org/g/7xugwg – Peter Cordes Aug 19 '17 at 17:47
Oh wow, ICC does auto-vectorize it. https://godbolt.org/g/6boZSh. But MSVC, gcc, and clang all fail in this trivial case. For a non-static array, ICC gets to an alignment boundary [so reading past the end of an object can't fault](https://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64). valgrind would still be unhappy, though. – Peter Cordes Aug 19 '17 at 17:58

strlen performance implementation

3 Answers3

Linked

Related