Do both of these strlen functions translate to the same assembly code?

Question

I have 2 different versions of strlen which should do the same thing, but I'm not sure which one would be faster, more readable or more energy efficient.

//version 1
size_t strlen(const char *str) {
   size_t retval;
   for(retval = 0; *str != '\0'; str++) retval++;
   return retval;
}

//version2
size_t strlen(const char *str) {
   const char *s;
   for (s = str; *s; ++s);
   return(s - str);
}

Or would they just translate to the same assembly code?

When asking people to look at working code, you should probably use the Code Review site. — dandan78, Jun 20 '18 at 08:17
As for what assembly code the compiler would translate it to, build both variants with optimizations enabled and compare. — Some programmer dude, Jun 20 '18 at 08:17
@Someprogrammerdude The Standard won't do for what I use them for. — Got To Figure, Jun 20 '18 at 08:18
The standard `strlen` function should be highly optimized, and as it's the *standard* function it should be available everywhere. It seems that you should probably ask a very different question, explaining your *actual* problem, and what you have tried to solve it. Don't forget to tell us *why* the standard function "won't do". — Some programmer dude, Jun 20 '18 at 08:20
both will translate to very similar code, there's no particular point to bother with digging deeper into it, as from performance point of view both will be a bit worse than standard implementation, and from code size point of view they will be probably +-2B equal. If you would be going for really minimal code, you would write them in pure assembly. So I can't figure out, what you are actually trying to solve, because the question makes little sense, "as is" you are just wasting time, pick one variant randomly and use it. (like "better" in what sense??) — Ped7g, Jun 20 '18 at 08:26
@Adminy What exactly doesn't the standard `strlen` function do that you need to be done? — fuz, Jun 20 '18 at 08:32
Both functions run in O(n) time where n is the length of `*str`, i.e. linear in the length of the input string. So performance should be similar for long strings. Taking a look behind the O-Notation we can see that Version 1 does one comparison and two additions per iteration. Version 2 only does one comparison and one addition and one subtraction at the end. We get 2n additions for version 1 and n + 1 additions for version 2. And n comparisons for both of them. — förschter, Jun 20 '18 at 08:41
@joelfischerr yup, but the modern compilers will optimize that out, so you can expect *n* comparisons and *n* additions only, nothing more. Code-size wise the second variant + clang-6.0.0 wins, but in terms of performance they are pretty much all equal (and that said, all of them are sub-optimal compared to standard one, which is probably using SSE/AVX to do about n/4 or n/8 comparisons using instructions operating over pack of values simultaneously). godbolt live used to compare code: https://godbolt.org/g/zyDj89 — Ped7g, Jun 20 '18 at 08:52
Rather, standard library strlen runs in "O(n/align + some)" time. glibc for example uses non-standard C to work with 32 bit chunks instead of 8 bit ones. Though if you would compile it with a standard compiler such as `gcc -std=c11`, glibc would invoke massive UB for these kind of functions. (Likely they compile with `-fno_strict_aliasing` and various other such non-standard settings.) — Lundin, Jun 20 '18 at 09:31
in fact it's `str++` + `retval++` vs `s++` Any new compiler will optimize it, using just `cmp byte ptr [ base + register ]` and a `inc register` inside the loop, question is: do you have an optimizing compiler? — Tommylee2k, Jun 20 '18 at 10:12
@Lundin: glibc's `strlen` implementations are hand-written in assembly, making it safe to do anything (and impossible for the optimizer to break anything, because there is no C definition to inline / link-time-optimize according to strict-aliasing rules, it's a pure blackbox.) e.g. for x86 with AVX2, the main loop checks 2 whole cache lines at a time with 4x 32-byte YMM vectors: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strlen-avx2.S.html, using `vpminub` to get vectors with a `0` where any one of the 4 input vectors had a `0` instead of pcmpeqb/pmovmskb separately. — Peter Cordes, Jun 20 '18 at 14:34
@Tommylee2k: Interestingly, gcc for x86 will actually use a pointer increment for one but an indexed addressing mode for the other (see Ped7g's Godbolt link). This could be relevant for uop throughput in a hyperthreading situation (sharing a core with another thread), see [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634). Also, `cmp [mem],imm` might not be able to macro-fuse with a jcc, or might not be able to micro-fuse. Something like that, I forget. — Peter Cordes, Jun 20 '18 at 14:39
@PeterCordes So all the people posting [this code](https://stackoverflow.com/questions/20021066/how-the-glibc-strlen-implementation-works) all over the internet are lying? Or it has been rewritten? — Lundin, Jun 20 '18 at 15:52
@Lundin: Oops, yes glibc has a generic C implementation for platforms that don't have a hand-written asm version. But it has hand-written for the major platforms, including ARM32/64, PowerPC. Hmm, I don't see MIPS in the search box, and https://code.woboq.org/userspace/glibc/sysdeps/mips/ only has memcpy and some other things but not strlen. So I guess it's used there. It would be strict-aliasing safe if it loaded `long` using `memcpy` instead of pointer-casting, but that might hide the fact the pointer is aligned and not compile to an `lw` or `ld` (MIPS64 doubleword) instruction. — Peter Cordes, Jun 20 '18 at 15:58
@Lundin: So probably it's safe because it can't inline from the library into code that stores with a different type. A non-inline function call boundary is a compiler memory barrier, and makes strict-aliasing "violations" invisible. (This is glibc code, so the set of compilers / platforms where it has to work is somewhat limited.) It could well be unsafe if you made a static library with link-time optimization that could LTO with the program using it. But glibc's build system doesn't do that. — Peter Cordes, Jun 20 '18 at 16:02
@Ped7g I like your side by side comparison answer, sorry guys I should have mentioned I was trying to use this with a compiler that is told not to load the standard libraries. — Got To Figure, Jun 20 '18 at 20:42
What architecture / compiler? If you care about strlen performance, you shouldn't use a generic implementation anyway, and definitely not a one-byte-at-a-time loop (unfortunately compilers are not yet good enough to replace this code with something more efficient). If you're writing a kernel, that might mean you can't use SIMD vector registers (which could speed this up by ~16x on typical x86 CPUs), but you can still do better than 1 byte per clock with glibc's generic implementation, for example (see Lundin's link above). — Peter Cordes, Jun 20 '18 at 21:20
And even if you do want to ask about these specific implementations, then the answer will depend on exactly how your compile compiles each. One might be better with one compiler on one platform (e.g. where an indexed memory operand has no extra cost and the compiler compiles it that way) or the other might be better on other compiler + CPU combinations if it happens to lead the compiler into producing code that is slightly faster. — Peter Cordes, Jun 20 '18 at 21:23
@Ped7g Thanks for the explanation. I thought that there was still value in looking at how the code was written. — förschter, Jun 21 '18 at 13:20
@Ped7g Could you post it as an answer, I'd really be happy with that as the answer. — Got To Figure, Jun 21 '18 at 14:08
@joelfischerr well, there is certainly value to look at how code was written, and actually to write reasonably clean code, because the compiler has only limited machine time for its work, so the more the original source leads logically to efficient machine code, the more likely the optimizer will produce decent machine code. If the code is too complex/messy, the optimizer may fail to understand enough of it to remove the redundant parts within reasonable time/memory limits. But generally the compilers do good-enough job. Then there's this thing called "profiling" for the rest of it. — Ped7g, Jun 28 '18 at 11:53
Related: [Questions about the performance of different implementations of strlen](https://stackoverflow.com/q/34449407) for a basic SIMD strlen for x86. — Peter Cordes, Jul 07 '18 at 00:48

score 0 · Accepted Answer · edited Jun 20 '18 at 08:27

0

Both of your function's time complexity is O(n). So regarding performance on time theoretically both of them will take almost same time to compute. Though standard strlen function's time complexity is also O(n) but it may have more efficiency. Have a look at this.

edited Jun 20 '18 at 08:27

Mat

202,337
40
393
406

answered Jun 20 '18 at 08:23

Taohidul Islam

5,246
3
26
39

1

You have a hidden constant in the time complexity of the first version. It does twice as many additions. Which might be relevant if additions are expensive, but they are generally not so it most likely won't matter. – förschter Jun 20 '18 at 09:11
2

@joelfischerr - You should not assume that the compiler translates the code literally. Check this question from yesterday [Understanding what clang is doing in assembly, decrementing for a loop that is incrementing](https://stackoverflow.com/questions/50934651/understanding-what-clang-is-doing-in-assembly-decrementing-for-a-loop-that-is-i) – Bo Persson Jun 20 '18 at 11:20
3

The question was about the constant factor, not the complexity class. gcc compiles both of these about 16x slower (checking 1 byte at a time) than a good SIMD version on x86 (checking 16 bytes at a time with `pcmpeqb`/`pmovmskb`). – Peter Cordes Jun 20 '18 at 14:28

Do both of these strlen functions translate to the same assembly code?

1 Answers1