38

What are the problems of a zero-terminated string that length-prefixed strings overcome?

I was reading the book Write Great Code vol. 1 and I had that question in mind.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 5
    maybe if the terminating zero is missing you can have unterminated strings which could lead to buffer overflow exploit opportunities. With a length prefixed string you will only read the chars in the string and therefore can't overflow the buffer and just read random memory – Sam Holder May 26 '15 at 12:26
  • 7
    If you're writing C++ as the tag suggests, then don't use either. Use `std::string`. – Mike Seymour May 26 '15 at 12:28
  • 15
    Zero-terminated strings require [Shlemiel the painter's algorithms](http://www.joelonsoftware.com/articles/fog0000000319.html). – nwp May 26 '15 at 13:05
  • Um, measuring its length! – Steve May 26 '15 at 13:07
  • 1
    @nwp: "Require" is a bit of an overstatement, but they do tend to make it way too easy to invoke Shlemiel. – cHao May 26 '15 at 13:09
  • See also: http://programmers.stackexchange.com/questions/237286/why-do-c-arrays-not-keep-track-of-their-length/237289#237289 – Wayne Conrad May 26 '15 at 16:04
  • 7
    An inability to hold the ASCII `NUL` character, for one. – RBarryYoung May 26 '15 at 20:40

6 Answers6

32

One problem is that with zero-terminated strings you have to keep finding the end of the string repeatedly. The classic example where this is inefficient is concatenating into a buffer:

char buf[1024] = "first";
strcat(buf, "second");
strcat(buf, "third");
strcat(buf, "fourth");

On every call to strcat the program has to start from the beginning of the string and find the terminator to know where to start appending. This means the function spends more and more time finding the place to append as the string grows longer.

With a length-prefixed string the equivalent of the strcat function would know where the end is immediately, and would just update the length after appending to it.

There are pros and cons to each way of representing strings and whether they cause problems for you depend on what you are doing with strings, and which operations need to be efficient. The problem described above can be overcome by manually keeping track of the end of the string as it grows, so by changing the code you can avoid the performance cost.

Jonathan Wakely
  • 166,810
  • 27
  • 341
  • 521
  • Maybe this is a foolish question, but how does `strcat()` can work with _a length-prefixed string_? I mean I want to ask there are any alternate versions? `C` does not operate _natively_ on that aspect, right? – Sourav Ghosh May 26 '15 at 12:59
  • It doesn't work with length-prefixed strings, because C doesn't use length-prefixed strings. I'm talking about a hypothetical `strcat` in a world where C uses length-prefixed strings. In that world `strcat` would work with length-prefixed strings, which would mean it doesn't need to find the end of the string before appending. – Jonathan Wakely May 26 '15 at 13:03
  • 4
    It doesn't help that `strcat` returns the "wrong" value. It could have been defined to return a pointer to the nul terminator of `destination` instead of the start of it. But that would still only help in cases where the concatenations occur all together: in general you'd still need to store the length or the address of the end of the string in order to avoid measuring it. – Steve Jessop May 26 '15 at 14:25
  • overhead differences? For 1 byte that is easy to understand (and 1 reserved code point) you get unbounded. With length based, either a hard max size (overhead grows eith max size), a hardware dependent max size (ditto overbe), or a complex size encoding is needed (-1 means length is twice as many bytes, or negative counts number of bytes, etc). – Yakk - Adam Nevraumont May 26 '15 at 16:15
  • @SteveJessop: IMHO, the arguments to a proper `strcat` would have included the start address of each string, and the ending address of the destination, and the method should have returned the address past the last byte of text written [i.e. the address of the null byte unless there was no room for it]. Such a method could have been safely used to append a string of unknown length to another of unknown length, without having to determine the length of either string (and if the buffer only had space for N additional characters, the method need only scan N bytes of the string to append). – supercat May 26 '15 at 22:31
  • Is it common to profile a C++ program and find that the bottleneck is string concatenation? – Superbest May 27 '15 at 03:09
  • It's easy enough to write your own function that resembles strcat but returns a pointer to the final null, so that you can promptly turn around and use that pointer in the next concatenation operation. As you say, though, only useful if the concatenations come together. I used to do this back when I wrote in vanilla C. – Jay May 27 '15 at 06:13
  • 2
    @Superbest: no, it isn't. However, the C and C++ design principles are not, "publish standard functions, let people use them, realise they're sometimes a bottleneck, and then go back and change the design of the standard functions and release a breaking change to the standard". Hopefully for obvious reasons ;-) – Steve Jessop May 27 '15 at 06:50
  • 1
    @Yakk: "overhead grows with max size" isn't a practical problem. `size_t` is always sufficient to store the length of a string, and is similar to the overhead you get anyway from the memory allocator, or from storing a pointer to the string in order to use it. The fact that there is this overhead does sometimes matter, but actually it isn't the systems with large max size that notice it, it's the systems with small max size and very little memory that don't want every string to be a few bytes bigger. – Steve Jessop May 27 '15 at 06:54
  • @stev it was when null terminated strings where chosen. The pascal style string only used one byte, and has a max string kength of 255 as a result. Using size_t means you need to reformat to serialize (as the other end, or storahe, can have a different size_t), a slight annoyance. 9 bytes to store `'x'` seems excessive: note that not all strings are individually allocated. With `8` bytes I can store 0-7 length strings under null termination (8 length if I am willimg to forgoe the null for max length), and zero under yours. – Yakk - Adam Nevraumont May 27 '15 at 10:19
30

One problem is that you can not store null characters (value zero) in a zero terminated string. This makes it impossible to store some character encodings as well as encrypted data.

Length-prefixed strings do not suffer that limitation.

Galik
  • 47,303
  • 4
  • 80
  • 117
  • That's one problem, it's not the only one. – Jonathan Wakely May 26 '15 at 12:41
  • 1
    @JonathanWakely Perhaps but it is the only one I can think of. – Galik May 26 '15 at 12:44
  • 1
    @JonathanWakely At the time I thought it was the only problem, now I am slapping my forehead after reading your answer. I should have got that! (I'll amend the wording) – Galik May 26 '15 at 12:56
  • 3
    @JonathanWakely It is ironic to comment here "That's one problem" when your answer likely-wise does not address this issue. The inability for a C string to not contain an null character (other that the end) is a functional limitation as Galik answers - there is no work-around with C string. The problem of inefficiently, as in your answer, though true, is a run-time weak-ness and not a functional limitation. – chux - Reinstate Monica May 26 '15 at 14:12
  • 2
    @chux, I wasn't trying to address all the problems, just listing one that hadn't been mentioned (at the time I wrote it). This answer originally said "**The** problem is..." and that's what I commented on. The question does not say "what are the functional problems" it asks what the problems are. Both this answer and mine list problems. – Jonathan Wakely May 26 '15 at 15:51
  • DOS style strings did not have that problem either. (I can't remember how to output `That'll be $5 then` in a DOS console. Possibly it could only be done with `write` - which (oh irony) treats the text as a Pascal string.) – Jongware May 26 '15 at 20:35
  • I'd be (genuinely!) interested to see a character encoding that A) uses `char` as base type, and B) actually uses `\000` as anything other than terminator. If you're thinking of e.g. UTF-16, that would use `char16_t`... – DevSolar May 27 '15 at 10:06
  • @DevSolar Apparently `UTF-8` and (original)`ASCII` suffer from this limitation: https://en.wikipedia.org/wiki/Null-terminated_string#Character_encodings For this reason `Java` uses `Modified UTF-8` for its `JNI` when passing strings to `C++`. – Galik May 27 '15 at 10:19
  • @Galik: Since both ASCII and UTF-8 *define* the `\000` value as the (terminating) NULL, that's "works as designed", basically. I was wondering if you know an encoding I did not (since I work with character encodings basically every day). Phrased differently: If you need to store a `\000` anywhere else but at the end, you're not storing a *string*, you're storing a byte array (which you can still do in C, you just can't use string functions on it and expect it to work). – DevSolar May 27 '15 at 10:55
  • @DevSolar Reading the linked document the *null character* in `UTF-8` is not defined as a string terminator and a perfectly valid `UTF-8` string could contain several *null characters*. However, given that it has no character value, it is often (by convention) used as the end of string marker (but not always as in the case of `Java JNI`). Also the principle remains that we are limited on how we chose to encode strings when using *zero* as a *terminator* in a way that we are not limited when using *length-prefixed* strings. – Galik May 27 '15 at 11:38
  • @DevSolar Plus your restriction to single byte character encoding is a little bit artificial. Using larger character sizes is one way to deal with the problem *null terminated* (single byte) strings impose (though not entirely). However this problem is not encountered with *length-prefixed* strings. I can store a `UTF-16` string in a `char*` array as long as I also store its length somewhere. – Galik May 27 '15 at 11:39
23

First a clarification: C++ strings (i.e. std::string) aren't weren't required to end with zero until C++11. They always provided access to a zero-terminated C string though.

C-style strings end with a 0 character for historical reasons.

The problems you're referring to are mainly bound to security issues: zero ended strings need to have a zero terminator. If they lack it (for whatever reason), the string's length becomes unreliable and they can lead to buffer overrun problems (which a malicious attacker can exploit by writing arbitrary data in places where it shouldn't be.. DEP helps in mitigating these issues but it's off-topic here).

Community
  • 1
  • 1
Marco A.
  • 43,032
  • 26
  • 132
  • 246
  • _It has to be noted that is quite quick to scan through a string and check for a null terminator as a stopping point_ Surely that depends how long the string is, if you have to keep doing that check on every operation it can get expensive. _and that C treats strings as an array of integers._ Huh? – Jonathan Wakely May 26 '15 at 12:40
  • @JonathanWakely I meant that C was born with lower-level issues in mind. Not sure how to say that explicitly though. – Marco A. May 26 '15 at 12:43
  • 1
    @MarcoA. When you say "stack overflow", do you mean "buffer overruns"? – Mike Weir May 26 '15 at 13:06
  • 1
    @PhoenixX_2 Right, too much stackoverflow and I'm inserting it everywhere :) – Marco A. May 26 '15 at 13:10
  • *"First a clarification: C++ strings (i.e. std::string) aren't zero-terminated"* Actually they are, at least since C++11, the standard mandates that a call to `str()` returns the same memory buffer. See http://stackoverflow.com/questions/6077189/will-stdstring-always-be-null-terminated-in-c11 – Manu343726 May 26 '15 at 19:14
  • @Manu343726 Good point, I'm updating the answer with some details. Thanks. – Marco A. May 26 '15 at 19:25
  • 2
    Length-prefixed strings may be more resistant to read overruns than zero-terminated strings, but in the general case they're just as vulnerable to write overruns as zero-terminated strings are. – Mark May 26 '15 at 21:35
  • @Mark: One could, with at most one extra byte of overhead, have prefixed strings indicate e.g. "64 byte buffer holding 27-byte string". Such a thing would make it very easy for string methods to guard against buffer overruns without significant extra work client-side. – supercat May 26 '15 at 22:35
21

It is best summarized in The Most Expensive One-byte Mistake by Poul-Henning Kamp.

  1. Performance Costs: It is cheaper to manipulate memory in chunks, which cannot be done if you're always having to look for the NULL character. In other words if you know before hand you have a 129 character string, it would likely be more efficient to manipulate it in sections of 64, 64, and 1 bytes, instead of character by character.
  2. Security: Marco A. already hit this pretty hard. Over and under-running string buffers is still a major route for attacks by hackers.

  3. Compiler Development Costs: Big costs are associated with optimizing compilers for null terminating strings that would have been easier with the address and length format.

  4. Hardware Development Costs: Hardware development costs are also large for string specific instructions associated with null terminating strings.

Christian Sarofeen
  • 2,202
  • 11
  • 18
5

A few more bonus features that can be implemented with length-prefixed strings:

  1. It's possible to have multiple styles of length prefix, identifiable through one or more bits of the first byte identified by the string pointer/reference. In exchange for a little extra time determining string length, one could e.g. use a single-byte prefix for short strings and longer prefixes for longer strings. If one uses a lot of 1-3 byte strings that could save more than 50% on overall memory consumption for such strings compared with using a fixed four-byte prefix; such a format could also accommodate strings whose length exceeded the range of 32-bit integers.

  2. One may store variable-length strings within bounds-checked buffers at a cost of only one or two bits in the length prefix. The number N combined with the other bits would indicate one of three things:

    1. An N-byte string

    2. (Optional) An N-byte buffer holding a zero-length string

    3. An N-byte buffer which, if its last byte B is less than 248, holds a string of length N-B-1; if the 248 or more, the preceding B-247 bytes would store the difference between the buffer size and the string length. Note that if the length of the string is precisely N-1, the string will be followed by a NUL byte, and if it's less than that the byte following the string will be unused and could be set to NUL.

    Using such an approach, one would need to initialize strong buffers before use (to indicate their length), but would then no longer need to pass the length of a string buffer to a routine that was going to store data there.

  3. One may use certain prefix values to indicate various special things. For example, one may have a prefix that indicates that it is not followed by a string, but rather by a string-data pointer and two integers giving buffer size and current length. If methods that operate on strings call a method to get the data pointer, buffer size, and length, one may pass such a method a reference to a portion of a string cheaply provided that the string itself will outlive the method call.

  4. One may extend the above feature with a bit to indicate that the string data is in a region that was generated by malloc and may be resized if needed; additionally, one could safely have methods that sometimes return a dynamically-generated string allocated on the heap, and sometimes return an immutable static string, and have the recipient perform a "free this string if it isn't static".

I don't know if any prefixed-string implementations implement all those bonus features, but they can all be accommodated for very little cost in storage space, relatively little cost in code, and less cost in time than would be required to use NUL-terminated strings whose length was neither known nor short.

supercat
  • 77,689
  • 9
  • 166
  • 211
-19

What are the problems of a zero-terminated string that length-prefixed strings overcome?

None whatsoever.
It's just eye candy.

Length-prefixed strings have, as part of their structure, information on how long the string is. If you want to do the same with zero-terminated strings you can use a helper variable;

lpstring = "foobar"; // saves '6' somewhere "inside" lpstring

ztstring = "foobar";
ztlength = 6;        // saves '6' in a helper variable

Lots of C library functions work with zero-terminated strings and cannot use anything past the '\0' byte. That's an issue with the functions themselves, not the string structure. If you need functions which deal with zero-terminated strings with embedded zeroes, write your own.

pmg
  • 106,608
  • 13
  • 126
  • 198
  • 4
    Really? None? Other than the ability to handle Unicode in any encoding other than UTF-8 (with all the performance issues with certain string operations that UTF-8 has). Or the fact that any desyncronisation between your string and your helper variable can lead to security issues. – Phylogenesis May 26 '15 at 13:55
  • 7
    This answer is so wrong it makes me angry. Buffer overflows and encrypted data, if nothing else. – Millie Smith May 26 '15 at 15:29
  • Yes, it's a pithy and flippant answer that someone with nearly 60k reputation should know is unhelpful. – Phylogenesis May 26 '15 at 15:41
  • I understand the answer is unhelpful (especially since the question is also tagged C++). However by providing the length of the string in another variable you effectively make zero-terminated strings capable of everyhting length-prefixed strings are capable of with the same performance. – pmg May 26 '15 at 15:46
  • 10
    At that point it's not a null terminated string. It's a string prefixed by a length. And good luck passing that thing around. – Millie Smith May 26 '15 at 16:00