Fast strlen function vs aliasing rules

Question

I found this "fast strlen function" implementation:

// for x86 only
size_t my_strlen(const char *s) {
    size_t len = 0;
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

An optimization techique used here is obviously simple: read memory by natural CPU words (the code is old and assumes x32 CPU), rather then by simple bytes.

But this code is violating aliasing rules, and so cause undefined behaviour, which can be freely optimized out by the compiler (those making code even faster, but inccorect).

I also see now that it is not portable, as it tied to the little-endian endianness.

Or may be i am completly wrong here and the code above is correct? Is it correct for C? For C++?

It does violate the rule. But have you tried to benchmark it? I am pretty sure the compiler will do better job optimizing the standard `strlen`. — Eugene Sh., Jun 28 '16 at 15:28
[OT] If you use a `std::string` instead of a `char*` the `size` function is O(1) which is hard to beat :) — NathanOliver, Jun 28 '16 at 15:30
@NathanOliver It's ironic in some ways that we're advocating `string` over C-String for the purposes of speed :J — Jonathan Mee, Jun 28 '16 at 15:52
@JonathanMee It really depends what the OP is using the strings for. If they are static then I feel c-strings are fine. If I do any string manipulation I stick with `std::string` as I hate doing manual memory management. — NathanOliver, Jun 28 '16 at 15:54
@NathanOliver, thank you, but i am already OK with the strings, as i am using my own [StaticallyBufferedString](http://www.codeproject.com/Articles/1038545/High-Performance-Statically-Buffered-String). The question was about code correctness. — Amabo Carab, Jun 28 '16 at 15:59
If you're considering this you'll want to note the requirements that I posted as part of [my answer](http://stackoverflow.com/a/38080862/2642059) 1) You'd have to allocate `sizeof(unsigned) - 1` chars on the end of every string to use this. 2) The cost of requiring 2 load operations because `x` is unaligned will outweigh any cost savings 3) You'd have to templatize this function based on `sizeof(unsigned)` to make it portable. It's worth pointing out that [`string_view`](http://en.cppreference.com/w/cpp/string/basic_string_view) will remove many of the motivations for immutable strings. — Jonathan Mee, Jun 28 '16 at 16:08
@JonathanMee: For efficiency, the code should process single bytes until it has an aligned pointer. If that is done and an implementation is coded such that fetches which go beyond the end of an array but are within an allocated segment will, at worst, yield non-contagious indeterminate data [i.e. each read of the storage will yield a potentially-different value, but the values read will behave as normal values], then the array overreach wouldn't be a problem on platforms where memory segments must contain whole numbers of words [which is true on all implementations I know of]. — supercat, Jun 29 '16 at 15:23
@supercat That code would still not be defined by the C++ standard: http://stackoverflow.com/q/36425393/2642059 (Probably because as you say an operating system is not required to segment the memory based on memory layout.) But like you I don't know of any OS that behaves like that. So I believe in-spite of being undefined, if the extra steps are taken to ensure that `x` will be aligned the program will be well behaved. — Jonathan Mee, Jun 29 '16 at 15:51
@JonathanMee: I wish compiler writers would value the efficiency with which compilers can generate code *to perform various tasks* beyond the efficiency with which they can generate code *to generate the minimal Standard-mandated behavior for a particular piece of source text*, and recognize that regarding as defined constructs which might be expensive to support on some platforms, but not for the target, can make it possible to perform many tasks much more efficiently on common platforms. — supercat, Jun 29 '16 at 16:09
@JonathanMee: If a programmer knows that the execution platform will behave sensibly in a case not mandated by the Standard, even if the compiler can't know that, a compiler for a low-level language should allow a programmer to exploit that knowledge. I would suggest that a compiler which doesn't isn't really processing a low-level language. — supercat, Jun 29 '16 at 16:28
@supercat We're talking about different things here. You're talking about the way a compiler generates code for a platform. But the question is asking about the programmer's ability to write code to direct the compiler to do things that may be illegal on some platforms. C++ defines the input language, the generated code can take whatever form the compiler wants. — Jonathan Mee, Jun 29 '16 at 16:29
@supercat The compiler *is* allowing that here, this is C/C++ after all we can do all kinds of magic with pointers. But this particular magic is not considered defined behavior under the standard. (Which is why we get the seg fault) — Jonathan Mee, Jun 29 '16 at 16:31
@JonathanMee: Over the last decade, compilers have become very aggressive at deciding that if the authors of the Standard refrained from mandating behavior in a given situation because *some* execution platforms might behave unpredictably, that should be taken as license to behave unpredictably *even on platforms where behavior would otherwise be defined*. In many cases one may know enough about an execution platform to know that a straightforward translation of code won't trigger a seg-fault, but code may still fail anyway due to compiler "creativity". — supercat, Jun 29 '16 at 16:36
@supercat I'd use this example to disprove your statement, as the compiler *is* allowing the user to code outside what the language allows. But I'm sure you're thinking of a different example? Perhaps you'd care to share? — Jonathan Mee, Jun 29 '16 at 16:42
@JonathanMee: Common practice in both C and Unix used to be that if an undocumented aspect of an interface's behavior turns out to be useful, it should become part of the API. The authors of gcc seem to take the opposite attitude--they regard code which relies upon anything not mandated by the Standard as "broken" without regard for whether there would be any Standard-defined means of accomplishing the same task as cleanly and efficiently, or whether supporting the behavior in question would have any significant cost. — supercat, Jun 29 '16 at 16:45
@JonathanMee: As a consequence, I don't regard gcc as really "allowing" the programmer to do anything unless the behavior is so rigidly documented that the authors of gcc won't be able to claim they never had any obligation to support it. — supercat, Jun 29 '16 at 16:48
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115997/discussion-between-jonathan-mee-and-supercat). — Jonathan Mee, Jun 29 '16 at 17:56

score 5 · Accepted Answer · edited May 23 '17 at 10:28

This is just very bad code. Even the code's author warns:

This function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.
The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.
The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.

Even if this did not break the aliasing rule the burden placed on the coder to make my_strlen work is completely unjustifiable. It has been stated several times that strlen will already be tuned beyond anything the average coder could accomplish.

But an additional statement should be made for C++: Bjarne Stroustrup, the creator of C++, in the last page of chapter 4 in his book: "The C++ Programming Language" says:

Prefer strings over C-style strings

You'll find taking the size of a string far more performant than finding the size of a C-String.

EDIT:

In your comment you say you are working with StaticallyBufferedString which purports to solve string's "pooling memory model" which causes:

Unnecessary heap locks in a multithreaded context
Fragmentation from real-time size control

I'd like to suggest C++17's string_view which like all of C++17 was constructed with multi-threading in mind. It provides the functionality of a string backed by heap and constexpr friendly C-Strings. You can even get a jump start on learning about it with namespace experimental: http://en.cppreference.com/w/cpp/experimental/basic_string_view Unlike your time put in on StaticallyBufferedStrings the knowledge you gain will be perfectly portable and applicable to any future C++ work you do!

Thanks for your comment! ALL of us, through, had written some bad and even very bad code, for which we should be ashamed, so i wouldn't judge the author too heavy ;) — Amabo Carab, Jun 28 '16 at 16:07
@AmaboCarab You make a fair point. If you look at my question history on here I have demonstrated more than a few misunderstandings that have made their way into old code I've written. So I've edited to remove my caustic comments that were directed at the author. — Jonathan Mee, Jun 28 '16 at 16:10
a wise decision indeed. The ability to understand our faults, to admit and to correct them, to learn from our mistakes, is that what actually separtes the smart people from the foolish one :) — Amabo Carab, Jun 28 '16 at 16:23

score 4 · Answer 2 · answered Jun 28 '16 at 15:34

4

This is a bad idea, and will make your performance worse. The standard strlen function provided by modern compilers is already highly optimized, and will do a much better job than above. For example, on SSE-enabled CPUs (i.e. pretty much all of them) it will already use SSE/AVX instructions to do a vectorized search for the null terminator, and will consider more than 4 bytes at a time as above with fewer comparison-related instructions, and fewer branches that can be mis-predicted.

answered Jun 28 '16 at 15:34

Smeeheey

9,906
23
39

1

From the webpage: **This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.** It's optimized compared to the naive C implementation of `strlen()`, not the one built into the library. – Barmar Jun 28 '16 at 15:38
@Smeeheey, you are absolutly correct, yes, of course, the modern compiler will just use bulltin intrinsic (if it is enabled by the compiler options), but the question was about code correctness, because, as i see it now, it violates aliasing, alignment and endianess. – Amabo Carab Jun 28 '16 at 15:41
@Smeeheey i also doesn't understand how the branches in this code can be misspredicted, as the whole function's code is so small, that it fits even L1 cache many times. – Amabo Carab Jun 28 '16 at 15:44
1

Instruction caching and branch mis-prediction are two entirely separate concepts. Due to the pipeline, a CPU has to guess where to go after each branch (5 in your case - the four `if`s plus the `for`), and if its gets it wrong it has to flush the pipeline which is an expensive operation. This is irrespective of whether the code is in the cache or not. See [here](https://software.intel.com/en-us/articles/avoiding-the-cost-of-branch-misprediction) for more – Smeeheey Jun 28 '16 at 15:48
@Smeeheey, thank you for the clarification, my shame that i completly forgot about the mordern CPU pipeline. – Amabo Carab Jun 28 '16 at 16:02
If a standard-library function fits one's needs, it's probably going to be faster than a custom implementation. But suppose e.g. that one has strings which are terminated by dollar signs rather than zero bytes [a format used in I think CP/M]. Code which starts by comparing as many individual characters as are needed to fix the alignment and then processes words after that could outpace code which has to read all characters individually. – supercat Jun 28 '16 at 17:19
Even in cases where one needs to find the first zero, however, that wouldn't imply that the standard library function is optimal, since the optimal function to find the first zero in string which is likely to be long will perform badly on short strings, and the optimal function to find the first zero in strings that are likely to be short will perform badly on long ones. If the Standard had different names for "strlen optimized for short strings" and "strlen optimized for long strings" [which implementations could treat as synonymous or distinct at their leisure] then... – supercat Jun 29 '16 at 15:17
...code which used the version appropriate to the probable lengths of strings it was using could expect a modern library to implement it well. Absent such a distinction, I would expect many library versions are optimized for short strings rather than long ones, and a chunking version could achieve a massive speedup on longer strings. – supercat Jun 29 '16 at 15:19

supercat · Answer 3 · 2016-06-29T16:05:52.480

The function as written is neither optimal nor portable. That having been said, the authors of C89 included some badly-written rules which if interpreted strictly make C89 a much less powerful language than the earlier C dialects that existed on many platforms. The stated purpose of the rule is to avoid requiring a C compilers which given code like:

float *fp;
int i;

int foo(void)
{
  i++;
  *fp=1.0f;
  return i;
}

from having to pessimistically assume that the write to *fp might affect i despite the complete absence of anything that would suggest that something of type int might be affected. Code which used type punning for a variety of purposes including chunking optimizations was widespread when C89 was written, but most such code would have included clear indications to the compiler that aliasing was going to occur. Generally, if an object was going to be modified by a foreign-type pointer between two "normal" accesses, one or both of the following would occur between the two normal accesses:

The object's address would be taken.
A pointer would be converted from the object's type to another type.

Aside from the obvious case where a pointer is used to access an object of its precise type, the Standard mostly identifies cases where it would not be obvious that a compiler should assume aliasing was possible (e.g. between an "int" and a pointer of type "unsigned*", or between anything and a pointer of type "char*"). Given the rationale, I think the authors intended to focus on making sure that compiler writers handled the cases where there would be no reason to expect aliasing, but didn't think they needed to be told how to identify cases where it was obviously likely.

Chunking optimizations will be safe on compilers which recognize that the address-of and casting operators imply that cross-type aliasing is likely, provided that all use of the pointer resulting from a cast is performed before the next access using the non-cast pointer--a requirement that most chunking code would generally meet. Unfortunately, there is no standard for "sane-compiler C", and gcc uses the fact that the authors of the Standard didn't require that compilers handle the obvious-aliasing cases as justification to ignore them.

Still, the performance advantages from chunking optimizations may outweigh the performance costs of -fno-strict-aliasing, especially if code uses restrict qualifiers when appropriate. There are some situations, mainly involving global variables, where restrict isn't sufficient to enable useful optimizations; those could be handled by an aliasing mode which limits analysis to objects of static or automatic duration (like the objects in the rationale's example) but gcc offers no such mode.

BTW, I'm not sure what instruction timings are like on modern x86 processors, but on some ARM variants compilers would have a chance at producing optimal long-string code from something like:

uint32_t x01000000 = 0x01000000;
uint64_t *search(uint64_t *p)
{
  uint64_t n;
  uint32_t a,b;
  uint32_t v = x01000000; // See note
  do
  {
    n=*p++;
    a=(uint32_t)n;
    b=n>>32;
    if (a >= v  || (a << 8) >= v || (a << 16) >= v || (a << 24) >= v ||
        b >= v  || (b << 8) >= v || (b << 16) >= v || (b << 24) >= v) return p;
  } while(1);      
}

Figuring out which comparison was responsible for breaking out of the loop would take extra time, but keeping such considerations out of the loop may allow the loop itself to be more efficient.

Many ARM processors have an instruction which can compare a shifted value with a register; compilers sometimes need a little help to realize that 0x01000000 should be kept in a register (a compare-with-constant instruction exists, but doesn't include a "free" shift of the register being compared), but with help they can find the compare-with-shift. I haven't yet found a way to convince a compiler to generate optimal code for the ARM7-TDMI, which would be equivalent to:

search:
    mov    r1,#0x010000000
lp:
    ldrmia r0,{r2,r3}
    cmp    r1,r2
    cmplt  r1,r2,asl #8
    cmplt  r1,r2,asl #16
    cmplt  r1,r2,asl #24
    cmplt  r1,r3
    cmplt  r1,r3,asl #8
    cmplt  r1,r3,asl #16
    cmplt  r1,r3,asl #24
    blt    lp
    bx     lr

That would take 15 cycles per eight bytes; it could be adapted to take 25 cycles per sixteen bytes. A loop that processes eight bytes individually would take 42 cycles; unrolled to sixteen bytes would be 82 cycles. The best loop I've seen compilers generate for the uint64_t based code would be 22 cycles for eight bytes--almost half again as long as the optimal code, but still about twice as fast as a version using bytes.

Fast strlen function vs aliasing rules

3 Answers3