Fastest way to get the null char in a copied string in C

Question

I need to get the pointer to the terminating null char of a string.

Currently I'm using this simple way: MyString + strlen(MyString) which is probably quite good out of context.

However I'm uncomfortable with this solution, as I have to do that after a string copy:

char MyString[32];
char* EndOfString;
strcpy(MyString, "Foo");
EndOfString = MyString + strlen(MyString);

So I'm looping twice around the string, the first time in strcpy and the second time in strlen.

I would like to avoid this overhead with a custom function that returns the number of copied characters:

size_t strcpylen(char *strDestination, const char *strSource)
{
    size_t len = 0;
    while( *strDestination++ = *strSource++ )
        len++;
    return len;
}

EndOfString = MyString + strcpylen(MyString, "Foobar");

However, I fear that my implementation may be slower than the compiler provided CRT function (that may use some assembly optimization or other trick instead of a simple char-by-char loop). Or maybe I'm not aware of some standard builtin function that already does that?

I've done some poor's man benchmarking, iterating 0x1FFFFFFF times three algorithms (strcpy+strlen, my version of strcpylen, and the version of user434507). The result are:

1) strcpy+strlen is the winner with just 967 milliseconds;

2) my version takes much more: 57 seconds!

3) the edited version takes 53 seconds.

So using two CRT functions instead of a custom "optimized" version in my environment is more than 50 times faster!

Try this experiment: take about 100K strings (maybe each varying in size between 10 chars and 5K chars) and store their size. Then, use this known size to compare your function against strcpy and the known string size. Display your results here. — San Jacinto, Nov 13 '10 at 14:18
If your strings are not **LARGE** (like mega bytes long) it really doesn't matter. — pmg, Nov 13 '10 at 14:19
@pmg is right. Does this operation consume 10% or more of overall program time? (A few stackshots will tell you.) If not, you probably have bigger problems elsewhere. — Mike Dunlavey, Nov 13 '10 at 14:24
@Mike: I don't care if in practice the performance gain is negligible. This is just an exercise in looking for the best performing solution. — lornova, Nov 13 '10 at 14:34
I've said this repeatedly here. "Don't do it", "It's negligible", "Premature optimization is evil" kind of answers (which, by the way, are not answers but advice), should not be provided. They're patronizing, most of the time not even true, and other times, not interesting. — Dervin Thunk, Nov 13 '10 at 14:50
@Dervin: shrug. John Carmack his sainted self could show up here asking how to optimize a one-over-square-root function, and he'd get some chirping in the comments that he's almost certainly wasting his time, producing unmaintainable code, and the rest of it. As long as it doesn't deter people from actually answering the question, it's not harmful. — Steve Jessop, Nov 13 '10 at 14:58
@Lorenzo: sorry, I deleted my comment out from under you. *I* mentioned Microsoft, but only because user434507 was basing his answer from VC++2005, and that made me think *you* were on Windows. Perhaps incorrectly. If you want detailed help optimizing code, you must specify the platform, since (if nothing else) "the fastest way" will probably be different on different CPUs. — Steve Jessop, Nov 13 '10 at 15:03
@Steve: heh, I like the Carmack analogy. I wonder how much people just "parrot" this meme, instead of thinking about it for a minute... — Dervin Thunk, Nov 13 '10 at 15:04
@Dervin: nah, I think they think about it. They just disbelieve that the questioner is doing serious optimization work, and I guess also disbelieve that non-serious "optimization" is worthwhile as a learning experience. Hence it's serious optimization, or nothing. As an aside, stuff about "you must optimize the whole program, you cannot meaningfully optimize a routine in isolation" kind of leaves library-writers high and dry, since they basically want "acceptable worst case" optimization across a range of programs using their specific routine. — Steve Jessop, Nov 13 '10 at 15:05
@Lorenzo: For what it's worth, you might want to use the interface of `stpcpy`. It returns a pointer to the nul terminator in the destination, which is exactly what you want. With this `strcpylen` interface, even with user434507's minimal code, you're calculating the length, then adding it back on. I doubt that it's expensive, just unnecessary when what you want is the final value of `strDestination`, and you have it right there. — Steve Jessop, Nov 13 '10 at 15:14
@Lorenzo: @Dervin: A large subset of programmers truly don't understand the concept of solving the problem you have, so in the spirit of being helpful we sometimes offer guidance. I wasn't able to tell from your question that you weren't in that category. Sorry. — Mike Dunlavey, Nov 13 '10 at 16:38
@Steve Jessop: good idea, I will do a `stpcpy`-like inline function. — lornova, Nov 13 '10 at 16:58
@Mike: out of interest, what would you do if you were entrusted with the Microsoft codebase described by user434507 below, with a C implementation of `strcpy` and also an optimized assembler version? Would you ditch the assembler version, on the basis that you don't have a program in which it can be proven to be the bottleneck? Write various benchmarks specifically designed to have it as the bottleneck? Get a real app from a customer, and optimize that, perhaps modifying some of your lib code as a result? Quit your job as a compiler-vendor and get back to writing proper applications? ;-) — Steve Jessop, Nov 13 '10 at 17:07
I ask because I am actually curious, not just to make a point. I think that optimizing library code involves techniques which really aren't appropriate to optimizing programs, and that the best advice for optimizing programs can't teach. So I wonder how totally opposed you are to spot optimizations outside the context of any program, and how, in practice, I'd have done my job developing tools for 8 years if I was equally opposed. — Steve Jessop, Nov 13 '10 at 17:16
@Steve: I appreciate your question, and I think the answer depends on who you are. If you are writing a library, then some user somewhere may have a program in which `str*` happens to be the hotspot, so I would want it to be as fast as possible. For compiler optimization, same idea. Some program, somewhere, may need to burn up my code, so I'll optimize it all I can. However, as a user, which I carelessly assumed Lorenzo was, a whole different skill set is needed, namely how to find the problems one actually has. I suspect we don't disagree on that. — Mike Dunlavey, Nov 13 '10 at 18:03
@Mike: right, don't disagree at all. I think the "wrong place" optimization questions on SO are split between people who really are looking in the wrong place, and people who are learning / messing about, wondering how they *could* optimize some simple function, if it turned out to be necessary. I'm not sure how much is truly transferable from optimizing very simple loops like this, to practical daily optimization problems. Possibly not much, because like you say this isn't the kind of thing you normally find is the problem. But I like that people wonder how it's done. — Steve Jessop, Nov 13 '10 at 18:09
@Steve: ... and BTW, if part of your 8 years developing tools involved performance profiling, I bet we could have a fun exchange of views on that :) — Mike Dunlavey, Nov 13 '10 at 18:10
@Mike: it did. For most of it we didn't have instrumenting profilers, although IIRC the debugger did an automated version of your preferred stackshot, which could identify hot lines. It was mostly good old instruction counters and wall clocks. Often we optimized primarily for size rather than speed (mobile devices were small then). I mentioned customer apps because most times we tried to use them as benchmarks, we ended up wanting to optimize their code. Worst case was the infamous, "um, you might want to use StringBuffer" incident when a customer complained about our String performance... — Steve Jessop, Nov 13 '10 at 18:22
@Steve: Really? Your debugger was stackshot friendly? at truly random wall-clock times? and when the user actually wanted it? and the hot-ness of a line was the percent of stacks it was on? I'm impressed. That's what Zoom does. The only thing I would add is the utility of actually examining the program state at interrupt-time, and that not many interrupts are needed. I know people other than me have been doing this since forever. What I think is underappreciated is how really well it works. — Mike Dunlavey, Nov 13 '10 at 18:41
"at truly random wall-clock times?" - probably not, to be honest, but I think at least an attempt. I don't remember all the details, I was a consumer of that tool rather than a producer, it's possible it wasn't as random as you'd want. And it wasn't all that configurable, as I remember it just sat over the whole run of the program taking stacks, which I think were only so many levels deep. Interrupts from the debugger were certainly capable of perturbing program timings and so on. So it wasn't perfect, but it told you where the program kept being when interrupted. — Steve Jessop, Nov 13 '10 at 19:00
Oh, and I think you're right about program state, I think maybe it only snapshotted the thread of execution that was active at the time of the interrupt. So it didn't help much with code where the main problem was high contention on a mutex. On that system, if you repeatedly interrupted and looked at the thread you actually cared about, then there was a reasonable chance it would be waiting on the memory allocator mutex... — Steve Jessop, Nov 13 '10 at 19:04
@Steve: So near, yet so far. Here are two people on SO who figured it out completely without my badgering: http://stackoverflow.com/questions/266373/one-could-use-a-profiler-but-why-not-just-halt-the-program/317160#317160 http://stackoverflow.com/questions/2473666/tips-for-optimizing-c-net-programs/2474118#2474118 For products, I know Zoom "gets it". LTProf is nearly there. I hear other products can do it, like oprofile, but I can't say. Sorry if I kidnapped the discussion. I get carried away. — Mike Dunlavey, Nov 13 '10 at 19:10

Eugene Smith · Answer 1 · 2010-11-13T14:33:07.230

5

size_t strcpylen(char *strDestination, const char *strSource)
{
    char* dest = strDestination;
    while( *dest++ = *strSource++ );
    return dest - strDestination;
}

This is almost exactly what the CRT version of strcpy does, except that the CRT version will also do some checking e.g. to make sure that both arguments are non-null.

Edit: I'm looking at the CRT source for VC++ 2005. pmg is correct, there's no checking. There are two versions of strcpy. One is written in assembly, the other in C. Here's the C version:

char * __cdecl strcpy(char * dst, const char * src)
{
        char * cp = dst;

        while( *cp++ = *src++ )
                ;               /* Copy src over dst */

        return( dst );
}

edited Nov 13 '10 at 14:33

answered Nov 13 '10 at 14:20

Eugene Smith

9,126
6
36
40

2

I believe `strcpy` does not validate its input (it's undefined behaviour if either argument is NULL) and the library can copy in groups of 8 bytes, for instance. – pmg Nov 13 '10 at 14:26
@pmg: Copying in groups could trigger UB - what if the string is zero length (first character is null) and it tries to copy seven bytes beyond that? – sharptooth Nov 13 '10 at 14:33
@pmg, it does on the debug CRT. – CMircea Nov 13 '10 at 14:34
@sharptooth, that can be checked. – CMircea Nov 13 '10 at 14:34
@iconiK: How could it be chacked I wonder? What is the null character is at the data segment border and the memory beyond it is not even mapped into the address space? – sharptooth Nov 13 '10 at 14:37
3

The VC2005 assembly version does 4 bytes at a time and starts by aligning the source pointer. You can't have the data segment border in the middle of a dword. – Eugene Smith Nov 13 '10 at 14:41
By the way, your solution in much better than mine, as it doesn't requires increasing the length at each loop. How can I've done such a stupid mistake!? – lornova Nov 13 '10 at 14:43
@sharptooth If it checks by 8 bytes, these 8 bytes are aligned and thus are in every case either completely mapped or completely unmapped. The goal of using 8 bytes check is to exploit the 64 bitness of the data bus. The check will thus never go beyond the last 64 bit entity containing the 0 byte. – Patrick Schlüter Nov 13 '10 at 15:29
@sharptooth: In the simplest case, if source and destination are both 8-aligned, you can read 8 bytes, check for 0 using some SIMD instruction or other, then either write 8 bytes and loop or else work out how many bytes to write up to the 0 and you're finished. If source is 8-aligned and destination isn't, same basic deal but the write is harder. If source isn't 8-aligned, do a few bytes one at a time to begin with until the remainder is 8-aligned. A data segment border is necessarily 8-aligned, so although you might read past the end of an object, you won't hit unreadable memory. – Steve Jessop Nov 13 '10 at 17:23
@Steve Jessop: very interesting, but I don't understand what happens if the last character (i.e the null char) is not at the end of an 8-byte chunk. You would clearly not get unreadable memory, but when writing to the destination buffer, you wouldn't overwrite some memory over the last character? – lornova Nov 13 '10 at 17:33
3

@Lorenzo: before writing you would check whether there's a 0 byte anywhere in the block of 8 you've just read. If not, you write all 8 bytes. If so, you're nearly finished and you branch off to some code to write the remaining few bytes one at a time. I don't have the MS code to look at, btw, I'm just saying how it *could* be done. Actually, sounds like MS does it by 4, not by 8. – Steve Jessop Nov 13 '10 at 18:03
@Steve Jessop: OK, now that makes sense! So there is a machine language instruction to quickly detect a zero byte in a 4 (or 8) byte word? – lornova Nov 13 '10 at 19:38
2

@Lorenzo: I think there is in SSE, or at least that you can do it quickly in not many instructions. Even on CPUs where there isn't, it's generally faster to read an aligned word at a time than a byte at a time, so there's at least a potential gain in there. – Steve Jessop Nov 13 '10 at 19:49
1

@Lorenzo: You don't need a magic machine instruction. It's easily checked with bit arithmetic: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord – R.. GitHub STOP HELPING ICE Nov 14 '10 at 06:42

Dervin Thunk · Answer 2 · 2010-11-13T14:57:42.617

5

Hacker's Delight has a nice section on finding the first null byte in a C string (see chapter 6 section 1). I found (parts of) it in Google Books, and the code seems to be here. I always go back to this book. Hope it's helpful.

edited Nov 13 '10 at 14:57

answered Nov 13 '10 at 14:32

Dervin Thunk

19,515
28
127
217

score 1 · Answer 3 · answered Nov 13 '10 at 14:18

1

Use strlcpy(), which will return the length of what it copied (assuming your size parameter is large enough).

answered Nov 13 '10 at 14:18

chrisaycock

36,470
14
88
125

1

Debatable whether this is actually worth the reduction in portability. From the page you quote: "These are not C standard library functions, but are available in several Unix operating systems, including BSD, Mac OS X and Solaris, with notable exception of Linux." – Tim Nov 13 '10 at 14:23
@Tim From the same article: "Many application packages and libraries include their own copies of these functions, including glib, rsync and the Linux kernel itself." – chrisaycock Nov 13 '10 at 14:26
3

if you are implementing a custom function anyway, why not use strlcpy as template, and even use it if it's available (it's faster).. much better than to invent a third function signature for the same thing. – u0b34a0f6ae Nov 13 '10 at 14:28
@Chris Good spot: I hadn't seen that. Nevertheless, finding yourself tied into additional libraries which vary from platform to platform still seems like a poor choice to me. Also function names should not start with `str` for compatibility reasons (see the standard for more details). – Tim Nov 13 '10 at 14:31
1

@Tim Performance fanatics have to deal with duck-taped portability all the time. Just look at sendfile(); BSD and Linux have totally different versions. – chrisaycock Nov 13 '10 at 14:35
@Chris Absolutely, and I'm very grateful for the programs I use running fast. :-) I just think that some have a tendency to attempt to optimise where not strictly necessary, and lose out on other things, like supportability. "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet." - Michael A. Jackson But don't let me stop you having fun! – Tim Nov 13 '10 at 14:40
+1 for this answer. Source code for strlcpy: http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/strlcpy.c?rev=1.11 – Fred Foo Nov 13 '10 at 16:22
@kaizer.se: definitely not, because `strlcpy` has a different meaning (and thus does additional unnecessary work) than `strcpy` + `strlen` (it is actually a safer version of `strncpy`). – lornova Nov 13 '10 at 16:54

score 1 · Answer 4 · answered Nov 13 '10 at 15:07

1

You can try this:

int len = strlen(new_str);
memcpy(MyString, new_str, len + 1);
EndOfString = MyString + len;

It makes sense only if the new_str is large, because memcpy is much faster that standard while( *dest++ = *strSource++ ); approach, but have extra initialization costs.

answered Nov 13 '10 at 15:07

ruslik

14,714
1
39
40

-1 as you will still have a double loop, once in `strlen` and once in `memcpy` – lornova Nov 15 '10 at 00:08
@Lorenzo can you at least try and benchmark it? Have you ever seen how CRT versions of `strlen` and `memcpy` looks like? We a talking about superscalar processor here. – ruslik Nov 15 '10 at 00:18

score 1 · Answer 5 · answered Dec 19 '10 at 17:32

Just a couple of remarks: if your function is not called very often then it may run faster from your code than from the C library because your code is already in the CPU caches.

What your benchmark is doing, is to make sure that the library call is in the cache, and this is not necessarily the case in a real-world application.

Further, Being inline could even save more cycles: compilers and CPUs prefer leaf function calls (one level encapsulation rather than several call levels) for branch prediction and data pre-fetching.

It al depends on your code-style, your application, and where you need to save cycles.

As you see, the picture is a bit more complex than what was previously exposed.

score 0 · Answer 6 · answered Nov 13 '10 at 14:22

0

I think you may be worrying unnecessarily here. It's likely that any possible gain you can make here would be more than offset by better improvements you can make elsewhere. My advice would be not to worry about this, get your code finished and see whether you are so short of processing cycles that the benefit of this optimisation outweighs the additional work and future maintenance effort to speed it up.

In short: don't do it.

answered Nov 13 '10 at 14:22

Tim

9,171
33
51

Gosh. Lots of people voting me down. I must have hit a sore nerve somewhere. :-( It would be helpful if you were to leave a comment explaining why you disagree so vehemently. – Tim Nov 13 '10 at 20:53
1

I'm among the downvoters. Sorry for not having explained; I thought that given the discussion of the question itself the downvote was auto-explicative. I think that your answer is out of topic, as you are actually answering to a different question (something like "should I bother to optimize that?"). – lornova Nov 13 '10 at 22:12
Thanks for the comment - much appreciated. :-) I'll take that into account in future. – Tim Nov 14 '10 at 09:56

score 0 · Answer 7 · answered Nov 14 '10 at 07:01

Try memccpy() (or _memccpy() in VC 2005+). I ran some tests of it against strcpy + strlen and your custom algorithm, and in my environment it beat both. I don't know how well it will work in yours, though, since for me your algorithm runs much faster than you saw, and strcpy + strlen much slower (14.4s for the former vs. 7.3s for the latter, using your number of iterations). I clocked the code below at about 5s.

int main(int argc, char *argv[])
{
    char test_string[] = "Foo";
    char new_string[64];
    char *null_character = NULL;
    int i;
    int iterations = 0x1FFFFFFF;

    for(i = 0; i < iterations; i++)
    {
        null_character = memccpy(new_string, test_string, 0, 64);
        --null_character;
    }

    return 0;
}

That's very nice, but unfortunately it doesn't work with Unicode strings (I'm actually using `wcscpy` and `wcslen`). — lornova, Nov 15 '10 at 00:02

score -2 · Answer 8 · answered Nov 13 '10 at 14:13

-2

Check out sprintf.
http://www.cplusplus.com/reference/clibrary/cstdio/sprintf/

answered Nov 13 '10 at 14:13

swatkat

4,785
1
23
17

Yes, but this involves parsing the format string, I doubt it could be faster. – lornova Nov 13 '10 at 14:17
@pmg: sprintf returns the number of copied characters. – lornova Nov 13 '10 at 14:18
@Lorenzo: I know ... I just doubt `EndOfString = MyString + sprintf(MyString, "%s", "foo");` is better than keeping `sprintf` out of the picture – pmg Nov 13 '10 at 14:24
1

Quick check on my machine: the `sprintf` version in my environment is 50 times slower than my custom function... – lornova Nov 13 '10 at 14:28
@Lorenzo, parsing single "%s" takes constant time, while strlen O(len). – Vovanium Nov 13 '10 at 14:31
@Vovanium: but my practical experiment still says it is 50 times slower. – lornova Nov 13 '10 at 14:59
@Lorenzo: I think it is for another reasons, than parsing format strings. Possibly because of deailng (internally) with streams. – Vovanium Nov 13 '10 at 15:42
My experience is that it's 50% slower with string sizes around 800 bytes, and 100 times slower with string sizes around 10 bytes. The entire difference is entry overhead; as the string length approaches infinity the times should converge. – R.. GitHub STOP HELPING ICE Nov 14 '10 at 06:35

Fastest way to get the null char in a copied string in C

8 Answers8