Fast generic strlen() implementation that can accept arbitrary terminator character

Question

template <char terminator = '\0'>
size_t strlen(const char *str)
{
    const char *char_ptr;
    const unsigned long int *longword_ptr;
    unsigned long int longword, magic_bits, himagic, lomagic;

    for (char_ptr = str; ((unsigned long int) char_ptr 
             & (sizeof (longword) - 1)) != 0; ++char_ptr)
       if (*char_ptr == '\0')
           return char_ptr - str;

    longword_ptr = (unsigned long int *) char_ptr;

    himagic = 0x80808080L;
    lomagic = 0x01010101L;

    for (;;)
    { 
        longword = *longword_ptr++;

        if (((longword - lomagic) & himagic) != 0)
        {
            const char *cp = (const char *) (longword_ptr - 1);

            if (cp[0] == 0)
                return cp - str;
            if (cp[1] == 0)
                return cp - str + 1;
            if (cp[2] == 0)
                return cp - str + 2;
            if (cp[3] == 0)
                return cp - str + 3;
        }
    }
}

The above is glibc strlen() code. It relies on trick to Determine if a word has a zero byte to make it fast.

However, I wish to make the function work with any terminating character, not just '\0', using template. Is it possible to do something similar ?

[`strchr`](https://en.cppreference.com/w/cpp/string/byte/strchr) ? I expect it to be as fast as possible — 463035818_is_not_an_ai, Apr 08 '22 at 07:02
A lazy way to adapt the existing code: if say you are searching for `X` (0x58), then replace `longword = *longword_ptr++;` by `longword = *longword_ptr++ ^ 0x58585858;`. — Nate Eldredge, Apr 08 '22 at 07:07
@NateEldredge that sounds like a correct solution. Can you write a sample code and benchmark against `strchr` if possible? — Huy Le, Apr 08 '22 at 07:09
I can try, but it looks like the glibc code is meant for a platform with 32-bit `long int`. So I either have to adapt it, or compile as a 32-bit program which seems unfair for a benchmark. — Nate Eldredge, Apr 08 '22 at 07:11
wait, isn't `long int` 32 bit on most platforms ? I target x86 on Ubuntu — Huy Le, Apr 08 '22 at 07:11
@NateEldredge if the OP would have read just one section further in the webpage that is linked in the question, this is exactly the algortihm suggested there. — Jakob Stark, Apr 08 '22 at 07:14
@HuyLe: `long int` is 64 bits on x86-64 Linux for instance, and most other 64-bit platforms too. Windows x64 is one of the few exceptions that keeps it as 32 bits. — Nate Eldredge, Apr 08 '22 at 07:18
@NateEldredge I'd like a sample code that use 64 bit `long int` if possible. Thanks! — Huy Le, Apr 08 '22 at 07:22
You can look into the code of `strchr` [here](https://github.com/bminor/glibc/blob/master/string/strchr.c). It actually does what @NateEldredge suggested. But at the same time it looks for an ending zero byte. So that may be something to optimize. By the way the glibc code I linked switches between 32bit and 64bit implementations on compile time. — Jakob Stark, Apr 08 '22 at 07:23
@463035818_is_not_a_number: or if you're on a GNU system, glibc has `rawmemchr` assumes there will be a terminator. So it can loop exactly like `strlen` does, not having a 2nd exit condition based on either length or checking every byte for a `0` as well as the given character. Without that, `memchr` with `SIZE_MAX` or `PTRDIFF_MAX`is a good bet, if overread of up to 63 bytes or so past the terminator is allowed. (Vectorization / unrolling can assume the buffer is readable to the length you passed.) Or if you know the buffer size, pass that and it's totally safe, and faster than strchr. — Peter Cordes, Apr 08 '22 at 07:29
@JakobStark: That's the generic fallback implementation for ISAs where there isn't hand-written asm (e.g. MIPS: see [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035)). These 4 or 8 byte at a time bithacks are relatively slow on ISAs with SIMD tightly-coupled to the core (so you can branch efficiently on it), like x86-64 where we can efficiently check 16 or 32 bytes at a time. Call libc `memchr` or `strchr`; any decent impl will be faster on modern CPUs than pure portable C without intrinsics or `__attribute__((vector_size(16))`. — Peter Cordes, Apr 08 '22 at 07:32
@PeterCordes `memchr` sounds exactly like what I need. I know the buffer length before, thanks! — Huy Le, Apr 08 '22 at 07:34
@JakobStark: https://code.woboq.org/userspace/glibc/sysdeps/aarch64/memchr.S.html is the AArch64 `memchr`, using SIMD. (There is a nosimd version in case that's faster on some CPUs). (Glibc's AArch64 `rawmemchr` just calls `memchr` with length -1, since it knows the implementation detail that it won't touch a page past the terminator.) Similarly, x86-64 always uses SIMD for memchr, e.g. with AVX2 if available (runtime dispatching during dynamic linking): https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memchr-avx2.S.html — Peter Cordes, Apr 08 '22 at 07:38

Peter Cordes · Accepted Answer · 2023-01-06T00:17:50.013

Use `std::memchr` to take advantage of libc's hand-written asm

It returns a pointer to the found byte, so you can get the length by subtracting. It returns NULL on not-found, but you said you can assume there will be a match, so we don't need to check except as a debug assert.

Even better, use rawmemchr if you can assume GNU functions are available, so you don't even have to pass a length. (Or not, since glibc 2.37 deprecates it.)

#include <cstring>

size_t lenx(const char *p, int c, size_t len)
{
    const void *match = std::memchr(p, c, len);  // old C functions take char in an int
    return static_cast<const char*>(match) - p;
}

Any decent libc implementation for a modern mainstream CPU will have a fast memchr implementation that checks multiple bytes at once, often hand-written in asm. Very similar to an strlen implementation, but with length-based loop exit condition in the unrolled loop, not just the match-finding loop exit condition.

memchr is somewhat cheaper than strchr, which has to check every byte for being a potential 0; an amount of work that doesn't go down with unrolling and SIMD. If data isn't hot in L1 cache, a good strchr can typically still keep up with available bandwidth, though, on most CPUs for most ISAs. Checking for 0s is also a correctness problem for arrays that contain a 0 byte before the byte you're looking for.

If available, it will even use SIMD instructions to check 16 or 32 bytes at once. A pure C bithack (with strict-aliasing UB) like the one you found is only used in portable fallback code in real C libraries (Why does glibc's strlen need to be so complicated to run quickly? explains this and has some links to glibc's asm implementations), or on targets where it compiles to asm as good as could be written by hand (e.g. MIPS for glibc). (But being wrapped up in a library function, the strict-aliasing UB is dealt with by some means, perhaps as simple as just not being able to inline into other code that accesses that data a different way. If you wanted to do it yourself, you'd want a typedef with something like GNU C __attribute__((may_alias)). See the link earlier in this paragraph.)

You certainly don't want a bithack that's only going to check 4 bytes at a time, especially if unsigned long is an 8-byte type on a 64-bit CPU!

If you don't know the buffer length, use `len = -1` in C11 / C++17

Use rawmemchr if available, otherwise use memchr(ptr, c, -1).
That's equivalent to passing SIZE_MAX.

See Is it legal to call memchr with a too-long length, if you know the character will be found before reaching the end of the valid region?

It's guaranteed not to read past the match, or at least to behave as if it didn't, i.e. not faulting. So not reading into the next page just like an optimized strlen, and probably for performance reasons not reading into the next cache line. (At least since C++17 / C11, according to cppreference, but real implementations have almost certainly been safe to use this way for longer, at least for performance reasons.)

The ISO C++ standard itself defers to the C standard for <cstring> functions; C++17 and later defer to C11, which added this requirement that C99 didn't have. (I also don't know if there are real-world implementations that violate that standard; I'd guess not and that it was more a matter of documenting a guarantee that real implementations were already doing.)

The POSIX man page for memchr guarantees stopping on a match; I don't know how far back this guarantee goes for POSIX systems.

Implementations shall behave as if they read the memory byte by byte from the beginning of the bytes pointed to by s and stop at the first occurrence of c (if it is found in the initial n bytes).

Without a guarantee like this, it's hypothetically possible for an implementation to just use unaligned loads starting with the address you pass it, as long as it's far enough from the ptr[size-1] end of the buffer you told it about. That's unlikely for performance reasons, though.

`rawmemchr()`

If you're on a GNU system, glibc has rawmemchr which assumes there will be a match, not taking a size limit. So it can loop exactly like strlen does, not having a 2nd exit condition based on either length or checking every byte for a 0 as well as the given character.

Fun fact: AArch64 glibc implements it as memchr(ptr, c, -1), or as strlen if the character happens to be 0. On other ISAs, it might actually duplicate the memchr code but without checking against the end of a buffer.

Glibc 2.37 will deprecate it, so apparently not a good idea for new code to switch to it now.

re "why cppreference says"..., see https://stackoverflow.com/a/47316040/273767 — Cubbi, Apr 08 '22 at 15:22
@Cubbi: Thanks, if I'd remembered that question I upvoted 5 years ago, I could have just linked that instead of researching and writing a whole section about it. :P Updated. — Peter Cordes, Apr 08 '22 at 15:32
@PeterCordes `rawmemchr` is going to be deprecated (2.37) so maybe add a warning. — Noah, Jan 05 '23 at 20:25

Fast generic strlen() implementation that can accept arbitrary terminator character

1 Answers1

Use std::memchr to take advantage of libc's hand-written asm

If you don't know the buffer length, use len = -1 in C11 / C++17

rawmemchr()

Use `std::memchr` to take advantage of libc's hand-written asm

If you don't know the buffer length, use `len = -1` in C11 / C++17

`rawmemchr()`