302

As much as I love C and C++, I can't help but scratch my head at the choice of null terminated strings:

  • Length prefixed (i.e. Pascal) strings existed before C
  • Length prefixed strings make several algorithms faster by allowing constant time length lookup.
  • Length prefixed strings make it more difficult to cause buffer overrun errors.
  • Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here.
  • Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings.
  • C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation.
  • Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

Several of these things have come to light more recently than C, so it would make sense for C to not have known of them. However, several were plain well before C came to be. Why would null terminated strings have been chosen instead of the obviously superior length prefixing?

EDIT: Since some asked for facts (and didn't like the ones I already provided) on my efficiency point above, they stem from a few things:

  • Concat using null terminated strings requires O(n + m) time complexity. Length prefixing often require only O(m).
  • Length using null terminated strings requires O(n) time complexity. Length prefixing is O(1).
  • Length and concat are by far the most common string operations. There are several cases where null terminated strings can be more efficient, but these occur much less often.

From answers below, these are some cases where null terminated strings are more efficient:

  • When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules.
  • In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

None of the above are nearly as common as length and concat.

There's one more asserted in the answers below:

  • You need to cut off the end of the string

but this one is incorrect -- it's the same amount of time for null terminated and length prefixed strings. (Null terminated strings just stick a null where you want the new end to be, length prefixers just subtract from the prefix.)

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • 119
    I always thought it was a rite of passage for all C++ programmers to write their own string library. – Juliet Dec 11 '10 at 20:22
  • 1
    @Juliet: Lol -- this is true. But that doesn't mean they should actually *use* their string library in production code. I'll stick with the standard bits TYVM :) – Billy ONeal Dec 11 '10 at 20:26
  • 3
    @Juliet: then you start wondering how your application would look like if you needed to care about a different string implementation for every library it depends on. – jweyrich Dec 11 '10 at 20:53
  • 36
    What's this about expecting rational explanations now. I suppose you'll want to hear a rationale for x86 or DOS next? As far as I'm concerned, the worst technology wins. Every time. And the worst string representation. – jalf Dec 11 '10 at 21:09
  • 2
    @jalf: 1. x86 won because it was cheaper, not because of any technical reason. (But that's another argument) 2. Length prefixing won everywhere but C. Don't see (ha!) how that's a win for null termination. – Billy ONeal Dec 11 '10 at 21:12
  • Even large systems built with C often create their own string data structure that stores the length next to the bytes, and builds a manipulation library around it. Win NT's UNICODE_STRING for example. – Ben Zotto Dec 11 '10 at 21:23
  • 5
    Why do you claim length prefixing strings are superior? After all, C became popular because it used null-terminated strings, which set it apart from the other languages. – Daniel C. Sobral Dec 12 '10 at 04:17
  • 52
    @Daniel: C became popular because it's a simple, efficient, and portable representation of programs executable on Von Neumann machines, and because it was used for Unix. It certainly isn't because it decided to use null terminated strings. If it was a good design decision, people would have copied it, and they haven't. They've certainly copied pretty much everything else from C. – Billy ONeal Dec 12 '10 at 04:33
  • 1
    @calavera: Haha -- holy war isn't too bad if people were actually trying to attack the points above. It's the "it must be the right answer because C did it" answers that are extremely annoying. No matter how good any given system is, there are going to be parts that suck. Just wish they would realize that it's entirely possible to rip on one of C's attributes and like C itself. Liking a language doesn't mean you like everything. (same goes for any "it must be the answer because X does it", replacing X with "C", "Linus", ) – Billy ONeal Dec 12 '10 at 16:59
  • 4
    `Why would null terminated strings have been chosen instead of the obviously superior length prefixing?` I honestly don't see how length prefixed strings are `obviously superior`. There are clear drawbacks and advantages to both choices, so a word like `superior` is meaningless. – Thomas Eding Dec 13 '10 at 04:38
  • @trinithis: it's kind of an example of begging the question in my opinion. – Robert S Ciaccio Dec 13 '10 at 04:52
  • 3
    @Billy Well, the "it must be wrong because others didn't do it" questions are annoying as hell too. You have provided NO FACTS on which is better, and, as a matter of fact, there are lots of things that are easier with null terminated strings. And you have provided no proof that C didn't win out because it uses null terminated strings. And that's the problem here: this question is pure flame and speculation, and never did I see one question on Stack Overflow that didn't deserve getting closed more than this one. – Daniel C. Sobral Dec 13 '10 at 10:58
  • 2
    @Daniel: No facts? I think I listed plenty in my question. – Billy ONeal Dec 13 '10 at 17:52
  • @Daniel: I've edited the question a bit. Better? – Billy ONeal Dec 13 '10 at 18:17
  • 2
    @Billy No facts. 1. Age != Better. 2. The reverse is also true. 3. Not true, one can have buffer overruns either way when managing one's own memory. 4. This is just a defense, not an advantage. 5. Popularity != Better. 6. Irrelevant -- C++ did not exist before C. 7. Irrelevant -- C can handle memory buffers with nulls fine, and C strings are used to display things on the screen, and null isn't a graphic character. So, no facts indicating that size-prefixed strings are better than null terminated ones. – Daniel C. Sobral Dec 13 '10 at 18:27
  • 1
    @Daniel: 1. I never meant to say that age meant better -- more meant to say that length prefixing doesn't postdate C and therefore could have been considered when it was designed. 2. I believe I have substantiated this better with my edit. 7. But C's standard library cannot. Nor can any C library expecting plain "C strings". So if you're reading some on disk format that's supposed to contain a string, and somebody has put a null there, your program is brought to it's knees. This "just works" in other languages without difficulty. – Billy ONeal Dec 13 '10 at 18:34
  • 6
    Concat is only O(m) with length-prefixing if you destroy one of the strings. Otherwise, same speed. The most common uses C strings (historically) were printing and and scanning. In both of these, null-termination is faster because it saves one register. – Daniel C. Sobral Dec 13 '10 at 18:39
  • 1
    @Daniel: `strcat` destroys one of the strings. – Billy ONeal Dec 13 '10 at 18:41
  • 4
    @Billy Excuse me, but my standard C library has a bunch of functions starting with "mem" which are ALL size-based. Nor are any of them new additions. – Daniel C. Sobral Dec 13 '10 at 18:41
  • 3
    @Billy You said "C's standard library cannot", but it can. – Daniel C. Sobral Dec 13 '10 at 19:08
  • @Daniel: But they don't. – Billy ONeal Dec 13 '10 at 19:27
  • 1
    @Billy In what way it is not possible to use mem* functions to handle stuff that have nulls in it? I have certainly used so, and if you have ever used a Unix, then you most surely have used code that has taken advantage of it as well. Of course, you can't *print* it, because null characters aren't printable. But you can manipulate it any way you want. Here's the title to a man page: "bcmp(3), bcopy(3), bzero(3), memccpy(3), memchr(3), memcmp(3), memcpy(3), memmove(3), memset(3) - byte string operations". None of these (and others) functions give a damn about nulls, so what's the problem? – Daniel C. Sobral Dec 13 '10 at 22:03
  • 1
    This is the stupidest flame war ever. I have no preference over either style, but Billy's stupid insistence is driving me toward null terminated strings. (Plus when is determining the length of a string one of the most common operations of a string? Concat, sure. Output, sure. Length... no. Even when considering concat, m+n -> 2max(m,n) -> 2n -> n -> no difference.) – Thomas Eding Dec 14 '10 at 02:26
  • 2
    Also, most strings I deal with are constant. Even when considering the non-constant ones, you'll never notice the difference between speeds. And if you do, then you are using the wrong data type anyway. You even mention `the efficiency argument doesn't sell me here`, so your arguments for efficiency of string algorithms wouldn't sell me (even if I considered it important). – Thomas Eding Dec 14 '10 at 02:34
  • 4
    If i could put myself anyplace in time it would be when K&R defined C. Nearly every security exploit ever has been a string or buffer overrun. If only they had included length prefixed strings/arrays, and corresponding language constructs to manipulate them. (And then i'd visit Copernicus; tell him that it's ellipses, not great shells. In both cases people would be spared decades of pain.) – Ian Boyd Jul 11 '11 at 13:58
  • @trinithis: I must point out though that I'm talking about large efficiency differences in terms of the algorithms, but only 3 bytes in terms of memory differences. The first is a big deal. The second is small cheese. As for "stupid insistence", if you would explain how I have been stupid rather than calling me stupid, perhaps you would be considered more seriously. – Billy ONeal Jul 18 '11 at 22:25
  • I [wrote about this in 2003](http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings) and stand by what I said then. – Tim Bray Dec 12 '10 at 03:24
  • I don't see why using a length prefix instead of null termination results in "cluttered up semantics". In both cases you've got a chunk of bytes. If you wanted to talk about C#/Java, which do things like string interning automatically, then you might have an argument.... – Billy ONeal Dec 12 '10 at 04:36
  • 3
    A length prefix is not part of a "chunk of bytes" unless your code handles it as such (which would be extremely slow under constant usage). It's a machine-specific (size, endian, alignment requirement, etc.) data object which makes strings require significant serialization to store in files, transmit over a network, etc. Look how many newbies you see sending (machine-specific) binary data over the wire on SO, and imagine how much worse it would be if strings contained binary data... – R.. GitHub STOP HELPING ICE Dec 12 '10 at 16:15
  • If you remember Microsoft's Assembler (MASM), it used `$` terminated strings. So the terminator (or lack thereof) is an arbitrary choice by language authors. – jww Feb 07 '14 at 08:31
  • @TimBray: IMHO, Java should have included a `string` primitive, which would be a 32-bit opaque type that might hold a reference to a Java object, but would not necessarily have to do so. I've been toying with implementing in C++ a garbage-collected string/array pool for RAM-limited systems, where each string reference would take 2 bytes outside the pool and 2-3 bytes inside it; strings under 32 bytes would have one extra byte of overhead for the length, and longer strings would add a little more. Arrays of references stored in the pool would take only two bytes per string. – supercat Mar 05 '15 at 05:53
  • @TimBray: The only time I see zero-terminated strings as advantageous is when passing a string by value to a method which will not need to use the string after it returns. In all other contexts, code that needs to work with strings whose length isn't known in advance will need to keep track of the lengths of allocated memory blocks somehow, and if it's having to do that the null terminator doesn't really buy much. – supercat Mar 05 '15 at 05:55
  • 2
    "I wrote about this in 2003 and stand by what I said then." -- That is perhaps the worst article ever written on strings. strncpy is "best practice"? Good grief. – Jim Balter Mar 05 '15 at 20:33
  • Remind C avoids automatically performed calculations for performance reasons. So you had to store the length by your self in the same way as you have to by using the `\0`. And in my view maintaining a `\0` at the end of each string by my self, is much easier then maintaining a number at the beginning and keep track of the amount of data following it and if the amount changes I also have to change the number. – dhein May 26 '15 at 12:57
  • @Zaibis: You need to change where the null terminator is, which is functionally the same transformation. – Billy ONeal May 26 '15 at 18:21
  • "Length and concat are by far the most common string operations." [citation needed] I find code that stores the length in addition to the string (either as a prefix or in a descriptor block) tends to use the length a lot. But code that uses zero-terminated strings often doesn't care about the length at all. And concat is still O(n+m) because, except in special cases, you first have to copy the original string to a buffer that's large enough to contain both. I'm not against pre-counted strings, but the question makes a lot of assumptions that bias the answers. – Adrian McCarthy Jun 26 '17 at 21:00
  • "a length[-]prefixed string is only three bytes wider than a null terminated string" plus some padding for alignment, because you probably want the count to be aligned. Zero-terminated strings that are allocated on the heap will also be aligned, but string literals can be packed at compile and link time and thus have no alignment overhead. – Adrian McCarthy Jun 26 '17 at 21:14
  • "...every other language..." is written in C – purec Sep 30 '18 at 09:58
  • @purec: That doesn't mean they use NTCTS. (And that isn't true anyway) – Billy ONeal Oct 01 '18 at 17:54
  • Sorry for beating a dead horse, but to add another twist to the story, @BillyONeal think about how `O(1)` strlen is achieved for prefixed strings. Where did the length come from to start with? Somebody still had to compute it. Whether it was the compiler or some other function that read it from a file or stdin or whatever. Somebody still had to perform an `O(n)` complexity operation to compute that length. So, no free lunch here either. – Super-intelligent Shade Dec 28 '21 at 05:24
  • @InnocentBystander We have `std::string_view` in large part because `strlen` was consuming ~1% of fleet wide CPU for Google. In general, nobody has to pay `O(n)` to calculate anything here since essentially all hardware I/O is length prefixed. – Billy ONeal Dec 29 '21 at 21:05
  • @DanielC.Sobral With length-prefixing, you can use CPU SIMD registers to copy large amounts of data at once, whereas doing that with null-terminated strings requires assuming that you can access some amount of data after the end of the string, rather than having to scan byte-by-byte. – Solomon Ucko Feb 25 '22 at 18:27
  • @SolomonUcko I'm not sure what that's in reply to, but SIMD did not exist when C and C++ were created. – Daniel C. Sobral Feb 28 '22 at 15:14
  • 1
    What I like about null-terminated strings is that you can jump into the middle of one and "ride it out" to save memory. For example, if you want your code to conditionally print "Not Integer" or "Integer" you only need to encode the string "Not Integer" and just conditionally add 4 bytes to the string's base pointer. You can't do that with Pascal-style. But let's be honest: nobody takes advantage of that in the modern era, so the point is moot. – puppydrum64 Dec 21 '22 at 15:33
  • 1
    @Super-intelligentShade While this is certainly true, if the programmer or compiler is counting the string length, that's a cost that the end user doesn't have to pay. It's kind of like having an array of pointers to strings to represent your program's error messages, using the error code as an index. Yes, you have to reserve `sizeof(int)` bytes per error message just for the pointers, but if multiple error codes have the same message, the overall savings add up over time. – puppydrum64 Dec 21 '22 at 15:36

20 Answers20

216

From the horse's mouth

None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

Dennis M Ritchie, Development of the C Language

AShelly
  • 34,686
  • 15
  • 91
  • 152
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • 19
    Another relevant quote: "...the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe..." – AShelly Nov 13 '15 at 17:41
157

C doesn't have a string as part of the language. A 'string' in C is just a pointer to char. So maybe you're asking the wrong question.

"What's the rationale for leaving out a string type" might be more relevant. To that I would point out that C is not an object oriented language and only has basic value types. A string is a higher level concept that has to be implemented by in some way combining values of other types. C is at a lower level of abstraction.

in light of the raging squall below:

I just want to point out that I'm not trying to say this is a stupid or bad question, or that the C way of representing strings is the best choice. I'm trying to clarify that the question would be more succinctly put if you take into account the fact that C has no mechanism for differentiating a string as a datatype from a byte array. Is this the best choice in light of the processing and memory power of todays computers? Probably not. But hindsight is always 20/20 and all that :)

Community
  • 1
  • 1
Robert S Ciaccio
  • 2,675
  • 1
  • 19
  • 21
  • @calavera this is not a wrong question. condiser it `asciiz` type, or null-terminated char arrays. – khachik Dec 11 '10 at 20:22
  • 34
    `char *temp = "foo bar";` is a valid statement in C... hey! isn't that a string? isn't it null terminated? – Yanick Rochon Dec 11 '10 at 20:22
  • 62
    @Yanick: that's just a convenient way to tell the compiler to create an array of char with a null at the end. it's not a 'string' – Robert S Ciaccio Dec 11 '10 at 20:24
  • 31
    @calavera: But it could have just as simply meant "Create a memory buffer with this string content and a two byte length prefix", – Billy ONeal Dec 11 '10 at 20:28
  • 3
    @calavera, a "string" is by definition "a linear sequence of symbols", and not necessarily a data type. It was made a type by higher language levels for convenience. In C, that is a string, in C# it's something else. The question is about C strings, which is just that; a pointer to a linear sequence of characters, followed by a `\0` char. – Yanick Rochon Dec 11 '10 at 20:29
  • 6
    It may not be a string object as one would think of it in C++, but it is *by definition* a C string. Quit trying to deny it. – Mark Ransom Dec 11 '10 at 20:30
  • 4
    1) C has strings. 2) C strings are not types, they're defined as char or wchar_t arrays that contain only one null character at the end. 3) What you're saying makes no sense. Why is "str" null-terminated but not size prefixed? – tiftik Dec 11 '10 at 20:33
  • 16
    @Billy: well since a 'string' is really just a pointer to char, which is equivalent to a pointer to byte, how would you know that the buffer you're dealing with is really intended to be a 'string'? you would need a new type other than char/byte* to denote this. maybe a struct? – Robert S Ciaccio Dec 11 '10 at 20:33
  • 3
    @calavera: You wouldn't. But you really don't know that with a C string either. Someone can pass you a non null terminated buffer anytime they want. – Billy ONeal Dec 11 '10 at 20:34
  • 1
    chill y'all :) I'm merely pointing out that in order for a string to be meaningful to the machine language itself, there has to be some way to determine whether a type is actually a string or not. the language authors obviously chose to go with one of the simplest and most memory efficient ways of doing so. Admittedly not ideal and still ambiguous, but simply prepending the length to the front of a byte array doesn't solve the problem unless you know that all byte arrays are prepended by their length. – Robert S Ciaccio Dec 11 '10 at 20:44
  • 1
    @Billy: see my comment above. I admit that it's not unambiguous, but it's less ambiguous than simply prepending the length to a byte array. – Robert S Ciaccio Dec 11 '10 at 20:45
  • 28
    I think @calavera is right, C doesn't have a data type for strings. Ok, you can consider an array of chars like a string, but this doesn't mean it's always a string (for string I mean a sequence of characters with a definite meaning). A binary file is an array of chars, but those chars don't mean anything for a human. – BlackBear Dec 11 '10 at 20:48
  • 2
    @Yanick: absolutely untrue. the definition of the word 'string' changes with the context. I'm speaking in the context of data types. – Robert S Ciaccio Dec 11 '10 at 20:53
  • 4
    A string doesn't need to be human-readable - "In computer science a string is any finite sequence of characters (i.e., letters, numerals, symbols and punctuation marks)." – jweyrich Dec 11 '10 at 21:05
  • @Billy: even if we combined the two approaches (prepending length and appending null), the resulting char/byte array would still have to be traversed in order to determine whether or not the bytes actually represented a string or not. So that approach would be less efficient initially, but more efficient once we've kinda-sorta validated that we're dealing with a string. the only unambiguous way to deal with strings using the existing types would be to create a struct that represents strings, not allow statements like `char* myStr = "Hello World";`, but only `strStruct* str = "hello world";` – Robert S Ciaccio Dec 11 '10 at 21:09
  • 2
    @jweyrich: You're right, I mean, no one would read a binary file and put its contents in a string variable, right? Well, I wouldn't do but I don't know c++. – BlackBear Dec 11 '10 at 21:10
  • 1
    @calvera: Why would you need to transverse the string? Most C I've seen doesn't bother checking if the buffer it gets is null terminated; if it's not null terminated a crash results. (Because in general, it's not possible to detect this kind of failure) – Billy ONeal Dec 11 '10 at 21:10
  • @Billy: yes, and that's exactly why it sucks to not have a string type integrated into the language. and also why functions like strcpy_s and the like have replaced the old versions. – Robert S Ciaccio Dec 11 '10 at 21:12
  • 4
    @Mark Ransom: You're wrong and don't tell me what to stop doing. If you're so sure of your definition, try this: `int* str = "this is just bytes, i have no idea what a string is";` the part on the right side is a **string literal**. C has no idea what a string type is, it only knows how to assign a string literal to a pointer. – Robert S Ciaccio Dec 11 '10 at 21:41
  • 4
    @Yanick Rochon: `char a[4] = "toto";` is also a valid C statement, but in this case "toto" may be a string but is not zero terminated (one of the mostly ignored small differences between C and C++). – kriss Dec 12 '10 at 00:29
  • 1
    @calavera: Sorry, meant to tell you +1 earlier. @BlackBear: I don't see why length prefixing precludes storing binary data in a string variable. Nor do I see why null termination does the same. People use `char *` to point to plain bytes all the time, and it's completely reasonable. – Billy ONeal Dec 12 '10 at 04:31
  • 4
    Just because C doesn't have a string type doesn't mean it doesn't have strings. It has a well defined convention, dating back to the beginning of the language, which is supported by the language via string literals. Any attempt to claim otherwise is just being excessively pedantic. – Mark Ransom Dec 12 '10 at 05:01
  • @Billy: thanks, I appreciate that we can discuss this and disagree without resorting to telling each other what to think or do. :) – Robert S Ciaccio Dec 12 '10 at 07:06
  • 1
    @Billy: I didn't want to say that. I just said I agree with @Calavera: "well since a 'string' is really just a pointer to char, which is equivalent to a pointer to byte, how would you know that the buffer you're dealing with is really intended to be a 'string'?". – BlackBear Dec 13 '10 at 10:26
  • 1
    @BlackBear: You don't. At least, not in C. – Billy ONeal Dec 13 '10 at 17:52
  • 4
    +1 from me. Realizing that an array of `char` is not the same as a string of characters (because there's no notion of encoding, for instance) is a key insight. – Frerich Raabe Dec 22 '10 at 09:04
  • @BillyONeal: with regard to your `char *` statement; people may do that, but isn't that bad practice? It assumes an `unsigned char`, by default. If they're really dealing with binary data and not ASCII strings on a system that has `signed char`, there's a problem. – mrduclaw Sep 29 '11 at 20:10
  • 1
    @mrduclaw: No, it really doesn't assume anything about signed or unsigned char. If you never actually access the data through the char type, it doesn't matter what the type of your buffer actually is. (Which is a common pattern when dealing with opaque data loaded from a file or elsewhere) – Billy ONeal Sep 30 '11 at 01:30
  • ' But it could have just as simply meant "Create a memory buffer with this string content and a two byte length prefix"' -- No, it could not have, because that would put the nth char of temp at temp[n+2], which is an atrocious thing to enshrine in a programming language. The other obvious reason to have NULL-terminated strings is so that you can have pointers into strings ... which is how string processing in C was always done until machines and compilers with efficient index operations came along. – Jim Balter Jun 30 '12 at 03:53
  • 4
    "String" is well-defined by the C standard to be (basically) a null-terminated sequence of characters. – Miles Rout Jun 21 '14 at 14:34
  • 1
    There are lots of other non-object oriendted languages and still have string support – phuclv Jul 15 '14 at 01:14
  • @JimBalter: How about having a pointer to the first character, and specify that if the *preceding* character is no greater than than UCHAR_MAX/2, it represents length; otherwise if the character before that is no greater than UCHAR_MAX/2, the length is p[-2]*(CHAR_MAX/2+1)+p[-1], etc. up to as many preceding bytes as are required? – supercat Mar 05 '15 at 14:35
  • @supercat Your lunatic idea has numerous problems, such as not being able to append to a string ... even copying such a thing to an existing buffer wouldn't be possible. And since your cockamamie scheme requires at least 2 bytes of length, it would have been inferior to simply using a fixed 16 bit preceding length on the original C machines. And of course your stupendously bad idea still wouldn't allow a pointer into a string for traversing it -- the second point of the comment you're responding to. – Jim Balter Mar 05 '15 at 18:45
  • @JimBalter: It would require one byte of length for strings up to 127 bytes, two for strings up to 16383 bytes, three for strings up to 2,097,151 bytes, etc. When allocating a string buffer of a certain size, leave a suitable amount of space for the length. If a method is told that a buffer has space for 32768 bytes, it would be entitled to assume that the three bytes preceding the pointer were available. – supercat Mar 05 '15 at 19:01
  • @JimBalter: Slight correction: specify that all code changing a string's length must write the new length in the same format as the old. To allocate a 32768-byte string, `char *s=malloc(32768+3)+3; s[-1]=0x80; s[-2]=0x80; s[-3]=0x00. Given a pointer to a string, one could find the allocation base via `char *p=s; do {--p;} while(*p & 0x80);`. Actually, once could enhance this idea by requiring that all *writable* strings be preceded by two variable-length numbers--the nearer one being the present length and the further one the allocated length. In that way... – supercat Mar 05 '15 at 19:15
  • ...one could effectively guard against buffer overruns without requiring code to manually keep track of buffer lengths. – supercat Mar 05 '15 at 19:24
  • @supercat " It would require one byte of length" -- sorry, I misread your design. But I'm not going to spend any more time on this silly and moot idea. If you want to design a new old-fashioned PL that uses it, go for it. – Jim Balter Mar 05 '15 at 19:40
  • 2
    "*A 'string' in C is just a pointer to char ...*" it's not (a *pointer*); it's a `0`-terminated ***array*** of `char`. – alk Jun 29 '16 at 17:03
  • @alk: An array in C _is just a pointer_! As far as I know, `array[3]` is actually doing `*(array + 3)` behind the scenes. (Ignoring things like [ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization), of course.) I've actually seen people iterating through strings using pointer manipulation. – SilverWolf Apr 30 '18 at 18:45
  • @SilverWolf Arrays and pointers are closely related, but they are not the same thing. – klutt Dec 15 '21 at 19:08
119

The question is asked as a Length Prefixed Strings (LPS) vs zero terminated strings (SZ) thing, but mostly expose benefits of length prefixed strings. That may seem overwhelming, but to be honest we should also consider drawbacks of LPS and advantages of SZ.

As I understand it, the question may even be understood as a biased way to ask "what are the advantages of Zero Terminated Strings ?".

Advantages (I see) of Zero Terminated Strings:

  • very simple, no need to introduce new concepts in language, char arrays/char pointers can do.
  • the core language just include minimal syntaxic sugar to convert something between double quotes to a bunch of chars (really a bunch of bytes). In some cases it can be used to initialize things completely unrelated with text. For instance xpm image file format is a valid C source that contains image data encoded as a string.
  • by the way, you can put a zero in a string literal, the compiler will just also add another one at the end of the literal: "this\0is\0valid\0C". Is it a string ? or four strings ? Or a bunch of bytes...
  • flat implementation, no hidden indirection, no hidden integer.
  • no hidden memory allocation involved (well, some infamous non standard functions like strdup perform allocation, but that's mostly a source of problem).
  • no specific issue for small or large hardware (imagine the burden to manage 32 bits prefix length on 8 bits microcontrollers, or the restrictions of limiting string size to less than 256 bytes, that was a problem I actually had with Turbo Pascal eons ago).
  • implementation of string manipulation is just a handful of very simple library function
  • efficient for the main use of strings : constant text read sequentially from a known start (mostly messages to the user).
  • the terminating zero is not even mandatory, all necessary tools to manipulate chars like a bunch of bytes are available. When performing array initialisation in C, you can even avoid the NUL terminator. Just set the right size. char a[3] = "foo"; is valid C (not C++) and won't put a final zero in a.
  • coherent with the unix point of view "everything is file", including "files" that have no intrinsic length like stdin, stdout. You should remember that open read and write primitives are implemented at a very low level. They are not library calls, but system calls. And the same API is used for binary or text files. File reading primitives get a buffer address and a size and return the new size. And you can use strings as the buffer to write. Using another kind of string representation would imply you can't easily use a literal string as the buffer to output, or you would have to make it have a very strange behavior when casting it to char*. Namely not to return the address of the string, but instead to return the actual data.
  • very easy to manipulate text data read from a file in-place, without useless copy of buffer, just insert zeroes at the right places (well, not really with modern C as double quoted strings are const char arrays nowaday usually kept in non modifiable data segment).
  • prepending some int values of whatever size would implies alignment issues. The initial length should be aligned, but there is no reason to do that for the characters datas (and again, forcing alignment of strings would imply problems when treating them as a bunch of bytes).
  • length is known at compile time for constant literal strings (sizeof). So why would anyone want to store it in memory prepending it to actual data ?
  • in a way C is doing as (nearly) everyone else, strings are viewed as arrays of char. As array length is not managed by C, it is logical length is not managed either for strings. The only surprising thing is that 0 item added at the end, but that's just at core language level when typing a string between double quotes. Users can perfectly call string manipulation functions passing length, or even use plain memcopy instead. SZ are just a facility. In most other languages array length is managed, it's logical that is the same for strings.
  • in modern times anyway 1 byte character sets are not enough and you often have to deal with encoded unicode strings where the number of characters is very different of the number of bytes. It implies that users will probably want more than "just the size", but also other informations. Keeping length give use nothing (particularly no natural place to store them) regarding these other useful pieces of information.

That said, no need to complain in the rare case where standard C strings are indeed inefficient. Libs are available. If I followed that trend, I should complain that standard C does not include any regex support functions... but really everybody knows it's not a real problem as there is libraries available for that purpose. So when string manipulation efficiency is wanted, why not use a library like bstring ? Or even C++ strings ?

EDIT: I recently had a look to D strings. It is interesting enough to see that the solution choosed is neither a size prefix, nor zero termination. As in C, literal strings enclosed in double quotes are just short hand for immutable char arrays, and the language also has a string keyword meaning that (immutable char array).

But D arrays are much richer than C arrays. In the case of static arrays length is known at run-time so there is no need to store the length. Compiler has it at compile time. In the case of dynamic arrays, length is available but D documentation does not state where it is kept. For all we know, compiler could choose to keep it in some register, or in some variable stored far away from the characters data.

On normal char arrays or non literal strings there is no final zero, hence programmer has to put it itself if he wants to call some C function from D. In the particular case of literal strings, however the D compiler still put a zero at the end of each strings (to allow easy cast to C strings to make easier calling C function ?), but this zero is not part of the string (D does not count it in string size).

The only thing that disappointed me somewhat is that strings are supposed to be utf-8, but length apparently still returns a number of bytes (at least it's true on my compiler gdc) even when using multi-byte chars. It is unclear to me if it's a compiler bug or by purpose. (OK, I probably have found out what happened. To say to D compiler your source use utf-8 you have to put some stupid byte order mark at beginning. I write stupid because I know of not editor doing that, especially for UTF-8 that is supposed to be ASCII compatible).

Marco A.
  • 43,032
  • 26
  • 132
  • 246
kriss
  • 23,497
  • 17
  • 97
  • 116
  • 1
    @kriss: very good answer. I appreciate that someone else acknowledges that the original question has some editorializing and is not quite what it appears to be. – Robert S Ciaccio Dec 12 '10 at 01:14
  • 1
    @kriss: My question is "why were null terminated strings chosen". I know there are better ways around dealing with things using libraries. But whenever you turn to a library solution to a problem like this, most of what you gain is lost in having to glue your library using code to existing code. Given that the standard uses null terminated strings, that's what you're stuck with. (And sometimes I still have to write this kind of glue because existing code won't support i18n GRRR). Also, I think several of your points apply equally to length prefixing (i.e. library functions). – Billy ONeal Dec 12 '10 at 04:24
  • 7
    ... Continued... Several of your points I think are just plain wrong, i.e. the "everything is a file" argument. Files are sequential access, C strings are not. Length prefixing can also be done with minimal syntactic sugar. The only reasonable argument here is the trying to manage 32 bit prefixes on small (i.e. 8 bit) hardware; I think that could be simply solved by saying the size of the length is determined by the implementation. After all, that's what `std::basic_string` does. – Billy ONeal Dec 12 '10 at 04:26
  • 3
    @Billy ONeal: really there is two different parts in my answer. One is about what is part of the 'core C language', the other one is about what standard libraries should deliver. Regarding to string support, there is only **one** item from the core language: the meaning of a double quote enclosed bunch of bytes. I am not really happyer than you with C behavior. I feel magically adding that zero at end of every double closes enclosed bunch of bytes is bad enough. I would prefer and explicit `\0` at the end when programmers wants that instead of the implicit one. Prepending length is much worse. – kriss Dec 12 '10 at 07:34
  • @kriss: The user of the language doesn't care about what the core language defines versus what the standard library defines. (Generally speaking) All joe C programmer cares about is "I have string here, and I want it on the console"... and the functions that do that accept null terminated strings It is my assertion that null terminated strings were a design error. I back that up by pointing at the fact that C (and C++) is the only (popular) place where they are used. I don't see why tacking on a length prefix to the character data inside `""` s is any more invasive than the null. – Billy ONeal Dec 12 '10 at 07:39
  • 3
    @Billy ONeal: that is just not true, the uses cares about what is core and what is libraries. The biggest point is when C is used to implement OS. At that level no libraries are available. C is also often used in embedded contexts or for programming devices where you often have the same kind of restrictions. In many cases Joes's should probably not use C at all nowaday: "OK, you want it on the console ? Do you have a console ? No ? Too bad..." – kriss Dec 12 '10 at 07:52
  • 1
    @kriss: Well, for the .01% of C programmers implementing operating systems, fine. I'll stick with the other 99.9%. Namely, because C without the standard library isn't C. When I'm talking about C, I'm talking about standard C, not some limited version used for bootstrapping an OS. – Billy ONeal Dec 12 '10 at 07:57
  • @Billy ONeal: you should also consider that C and C++ are the only languages used to implement OS kernels. You just need to have a mean to initialize bunch of bytes. c strings are an easy one. If you change the meaning of doubles quotes to some prepended string thing (the only part not in libraries)... You have to find another mean as easy and souple mean for initializing constant bytes literals. By the way, you can use sizeof of constant string literals. It is known at compile time, why would you want to strore it anywhere ? – kriss Dec 12 '10 at 07:58
  • 1
    @kriss: I'm not suggesting that C be changed. I asked why C made the decision it initially made. There's a difference there. Just because a limited form of C is used to implement OSs, doesn't mean that 99% of users of the language are doing that. C is a general purpose programming language, and being general purpose means it doesn't make compromises in order to make a specific task (i.e. OSs) easier. – Billy ONeal Dec 12 '10 at 08:01
  • 2
    @Billy ONeal: The C choice allow to easily implement prepending-length behavior (hey the libraries of other languages ar mostly written using C). The other way around is just impossible. If the language does not include **any** way to define a bunch of bytes you are just doomed, there is things that can't be done. – kriss Dec 12 '10 at 08:08
  • 1
    @kriss: It does include that, `char myBunchOfBytes[] = {'a', 'b', 'c'};` – Billy ONeal Dec 12 '10 at 08:09
  • @Billy ONeal: another aspect of C that is very similar is length of arrays. Really I believe the design choices here is the same. Most lnaguage store array length somwhere. C made another choice. Is it a design error ? – kriss Dec 12 '10 at 08:11
  • 1
    @Billy ONeal: are you kidding ? Length matters! – kriss Dec 12 '10 at 08:12
  • @kriss: You said "if the language does not provide any way" -- I said there is a way. And there is. The `""` syntax was not designed, nor is it intended to be used, as a random way to insert bytes into your program. It is intended to be used for *human readable strings*, which you can plainly see if you grab yourself a copy of the original (or the ANSI edition) K&R C book, where that's the **only thing it's ever used for**. – Billy ONeal Dec 12 '10 at 08:14
  • 1
    @Billy ONeal: `f = open({'m', 'y', ' ', 'p', 'a', 't', 'h'}, flags);` – kriss Dec 12 '10 at 08:15
  • @kriss: And there's no reason that `open` doesn't accept a length prefixed string, in which case you'd just use the plain `""` syntax you already use. Moreover, `open` isn't a C function, it is a POSIX syscall. – Billy ONeal Dec 12 '10 at 08:16
  • 1
    @Billy ONeal: and now we should enforce use of size-prepending string at kernel level ? Because that's what open is. And as far as I know kernels don't call much of C string manipulation libraries... – kriss Dec 12 '10 at 08:17
  • 1
    @kriss: The same way you enforce null termination now. You don't. – Billy ONeal Dec 12 '10 at 08:21
  • @Billy ONeal: there is syscalls where size is used, and the convention chosen (and as of history it's a C convention) is to pass both string data and length as separate parameters, because it is more souple. Size-prepended strings are just gluing together these two pieces of informations, and it is not necessary. Pass length when it is needed do not pass it when it is not. – kriss Dec 12 '10 at 08:24
  • 1
    @kriss: I know there's a history as a C convention. My question was, "Why was that convention there in the first place?". Because if C had been defined differently, then the syscalls would have been defined differently too. As far as "necessary", you do need some kind of way for a function to tell where the end of the string is. There are several ways of doing that. One is length prefix, one is null termination, one is passing a pointer to both the begin and end of the range. You really did pass two pieces of information to `open` in your above code -- the null tells it where the string ends. – Billy ONeal Dec 12 '10 at 08:28
  • @Billy ONeal: I don't understand your previous comment. Maybe it's the word enforce I used that is ambiguous. What I means is that you should change very low-level kernel APIS, even APIs for functions like open where string length is not unecessary and won't give any performance. And choose instead an API where you give it heterogenous datas (int and array of char). It does not look like a good design choice. – kriss Dec 12 '10 at 08:28
  • 1
    @Billy ONeal: another simple choice would indeed be to pass in two parameters to `open()`, say data and length. Now you would have to use a register to store length, but you *still* need to read chars to use them, maybe comparing them to entries in file system. That's less efficient because you use two registers instead of one. Putting the terminator inside the string is more economical in that case. I'm quite sure if you examine closely syscalls the same will be true for all of them involving zero terminated strings as input. – kriss Dec 12 '10 at 08:44
  • @kriss: You're implementing what the API should be in terms of C. If it was the "standard thing everybody does" to use length prefixing, then that's what you'd be using. It wouldn't be thinking "I'm passing two values", it'd be "I'm passing a string". Don't forget that C predates POSIX, the standard that defines most of the syscalls you're talking about. And you're not going to convince me much on the speed/register allocation argument, because other languages, despite being slower than C in the average case, are much faster than C w.r.t. string manipulation. – Billy ONeal Dec 12 '10 at 09:11
  • @kriss: I'm done arguing about this. If it was a good design decision, then there would be other programming languages that would have copied the behavior. (They copied most every other behavior from C -- there had to be a damn good reason to leave this bit out) – Billy ONeal Dec 12 '10 at 09:12
  • 7
    @Billy "Well, for the .01% of C programmers implementing operating systems, fine." The other programmers can take a hike. C was created to write an operating system. – Daniel C. Sobral Dec 13 '10 at 11:06
  • 1
    @Daniel: My K&R C book disagrees with you. – Billy ONeal Dec 13 '10 at 17:59
  • 7
    Why? Because it says it is a general purpose language? Does it say what the people who wrote it was doing when it created? What was it used for for the first few years of its life? So, what is it that it says that disagrees with me? It is a general purpose language _created to write an operating system_. Does it deny it? – Daniel C. Sobral Dec 13 '10 at 18:15
  • +1 from me; I don't quite agree with all your points but I appreciate that you actually put the effort into it and listed a few opinions in favor of null-terminated strings. – Frerich Raabe Dec 22 '10 at 09:06
  • 1
    [`strdup`](http://pubs.opengroup.org/onlinepubs/009695399/functions/strdup.html) is a standardised function. It is not in the C specification, but it is in the POSIX specification. – dreamlax Jan 04 '11 at 10:14
  • 1
    @dreamlax: yes. True, but POSIX is not C, and besides that is not the point. I was just pointing out that all functions hiding mallocs besides explicit alloc ones are likely to introduce hard to find bugs (and the problem is usually much much worse when using C++ libraries than C ones). As a personal experience I lost a few weeks pointing out a memory leak coming from an strdup... and I probably became overcautious of such things. – kriss Jan 04 '11 at 13:19
  • 1
    @Daniel: No, it does not deny it. However, it does define a standard library, and assumes that a user of that language will have access to that standard library. It says absolutely nothing about operating systems. – Billy ONeal Jul 18 '11 at 22:27
  • 2
    @Billy I'm still waiting to hear what K&R says that _contradict_ that C was created to write an operating system. Which, actually, you won't find, because C was created to write an operating system. K&R's C Programming Language is simply a book to teach people how to program in it, written years after the language was created. It is absolutely laughable that you even bother to argue whether C was created to write an operating system -- a well known fact -- and downright stupid to try to disregard that as having design consequences. – Daniel C. Sobral Jul 21 '11 at 20:08
  • 2
    @Daniel: It certainly was not created to write an operating system at the expense of all other possible uses of the language. It was created to be a systems programming language, which could be used to write an operating system. It was not created for the sole purpose of writing an operating system, because if that was true it would not be a systems programming language. – Billy ONeal Jul 21 '11 at 20:12
  • 2
    @BillyONeal "The C programming language was devised in the early 1970s as a system implementation language for the nascent Unix operating system. " -- so says Dennis Ritchie at http://cm.bell-labs.com/who/dmr/chist.html And here was Daniel's original claim that you say K&R disagree with: "C was created to write an operating system." The plain fact is that Daniel is right and you are wrong. – Jim Balter Jul 08 '13 at 05:22
  • 2
    @BillyONeal As for why strings are NUL-terminated: primarily a) so that the first char of the string is at str[0], not str[1] or str[2] or str[4], depending on the length of the length and b) so that strings can be traversed by pointers. These reasons are related to other aspects of the design of C. – Jim Balter Jul 08 '13 at 05:33
  • @Jim: I already described a design of length prefixed strings which would allow `str[0]` to remain the first character of the string. – Billy ONeal Jul 08 '13 at 17:20
  • 1
    @BillyONeal Your "design" is equivalent to adding a string type to the language ... strings could not simply be an array of chars with one of the chars meaning "terminator". It would have made the language more complicated and, as has been noted, would have required more space and registers, precious on pdp-7's and pdp-11's. The whole thing is moot because the design was chosen long ago and cannot change. If you care for some reason to believe that they made a mistake, go ahead and do so, but it's not really a valid SO question. – Jim Balter Jul 08 '13 at 18:46
  • @Jim: No, I don't propose adding a separate type for this. It wouldn't use more registers, I've already gone over that. As for the validity of the question, at least 160 people disagree with you. – Billy ONeal Jul 08 '13 at 20:25
  • 1
    I think you're incorrect on all three points, and it's unfortunate that so many people wasted so much time on something that cannot possibly come to anything, myself included. Over and out. – Jim Balter Jul 08 '13 at 20:31
  • 1
    "I don't propose adding a separate type for this" -- I will note again that this is impossible; a hidden length prefix such as used by MSFT's CString requires a new type (a struct and operator overloading), and two pointers, one to the beginning and one to the end, also requires a struct ... and these would have to be primitive to the language to handle string literals. – Jim Balter Mar 05 '15 at 19:21
  • 2
    "If it was a good design decision, then there would be other programming languages that would have copied the behavior. (They copied most every other behavior from C -- there had to be a damn good reason to leave this bit out)" -- this is so foolish and intellectually dishonest. All these other languages have *a string type* that is a primitive in the language, and they don't have the memory constraints of the PDP-7 that C was originally targeted to. – Jim Balter Mar 05 '15 at 19:23
  • @JimBalter: If string literals yielded pointers to prefixed strings, and many methods expected to receive pointers to either a prefixed string or a `struct {char kind; char *data; int length; int avail;}` (a function receiving a pointer could look at the first byte see whether it was a prefixed string or a string-info structure) and then programmers could keep track of what pointers were suitable for passing to such methods and what pointers weren't. Proper string types would be better, but not strictly necessary. – supercat Mar 08 '15 at 15:41
  • 1
    @supercat I've already explained why C literals cannot be represented as prefixed strings without a type and operator overloading ... str[n] yields the wrong char. And having the compiler allocate space at negative address offsets for the prefix is a nightmare. And only NUL-terminated strings provide that pointers into strings point to strings themselves ... an essential feature of early C string processing. In any case, NUL-terminated strings was *not* a bad design decision for the C language. And that's the last I'll say about this silliness. – Jim Balter Mar 08 '15 at 19:48
  • Ok, just one more thing: pointers to the beginning and end of a string would also provide that pointers into strings point to strings themselves, but they mean 3 more bytes per string, more time passing around two pointers instead of one, *and* a built in string type, as early C didn't even have struct copying built into the language. – Jim Balter Mar 08 '15 at 20:03
  • @JimBalter: Few people complain that nearly all sane implementations of `malloc` returns a pointer to data which is prefixed by the allocated size (the exact format of the memory-block prefix varies, but the requirement that `free` be able to release a memory block without being told its size means the size must be findable from a pointer to the block itself). The general way C uses pointers was a sensible programmer-effort-versus-memory tradeoff for the 1970s; that does not imply it's a sensible trade-off on processors with base+index addressing modes and register-based parameter passing. – supercat Mar 08 '15 at 20:38
  • 2
    malloc manages the heap! That's not exposed to the programmer nor requires compiler support the way that it would be if every string had bytes before its address. Your reference to malloc is idiotic and intellectually dishonest. It's not about what people "complain about", it's about what it would require to use it ... malloc's stored length (actually, a pointer to the next block) doesn't require anything of anybody other than malloc. " was a sensible programmer-effort-versus-memory tradeoff for the 1970s" -- which is the topic here! Goodbye. – Jim Balter Mar 08 '15 at 20:41
  • 1
    @kriss: Using the number of bytes for UTF-8 lengths is the only thing that makes sense. They are, at their core, `uint8[]`. You could have had length in terms of code points, but ***that does not help you*** --- after all, multiple code points must be combined in some cases into a single glyph (the algorithm for this can depend on things like the Unicode version, so ...). In most cases (e.g. concatenation, writing to console/stream/file, ...), you ***want*** the byte size, not the code point one. The only place where you want code points is when dealing with high-level characters directly. – Tim Čas Mar 02 '18 at 12:43
  • @kriss: (cont) ... and if you're dealing with high-level characters, you need to start considering the Unicode version, and cultural differences. You see, there is not a single spec that deals with the latter --- it requires a ***lot*** of domain-specific knowledge, and is prone to changes should grammar change [yes, it happens even artificially; e.g. Germany, 1996], or should some bug in the handling be discoevered. You need a lib like HarfBuzz then. *(On a side-note: Indeed, this negates the vast majority of supposed advantages of UCS-4/UTF-32)* – Tim Čas Mar 02 '18 at 12:46
  • @Tim Čas: well, number of bytes is often what matters, but not exactly the only thing that makes sense. In an app of mine I perform glyph rendering. In that case I have to find glyph definition from my string, and I find the right glyph using codepoint (and of course, I have to know the current police). For such use cases the character length makes perfect sense. Combining code points won't become a single glyph, I just have to draw both. – kriss Mar 02 '18 at 17:20
  • @Tim Čas: another use case, I also have. I'm transcoding data from one system to another and sometimes avec to convert UTF-8 to UTF-16. Knowing character length helps sizing target buffer memory (ok, to be honest in that use case, I don't want character length, I merely want UTF-16 equivalent length of some UTF-8 string. As In that case I'm really working with a subset of UTF-16 without extended chars character length is good enough for me). – kriss Mar 02 '18 at 17:32
  • @kriss: In the first case, code points are still not enough --- you *need* to handle combining characters. Where you don't, you're already iterating through it anyhow (not doing random access), so why does it matter? As for transcoding, sure, but that's a very specific use case, while the ***vast*** majority will need the byte length, not code point. – Tim Čas Mar 03 '18 at 00:19
  • 1
    @Tim Čas: combining characters are nothing special, they are taken into account by glyph metrics at least the way I'm doing it. In other use cases (spoken text?) it would be different. I don't have to know if the result will be understood as one or several chars by the reader. I agree I won't do random access in that case. The character length by itself is not very usefull indeed. But easily iterating by character, not by bytes is. Also actually knowing byte length will only be usefull to copy the full string, hence I just want to get it all. I don't really care the length. – kriss Mar 03 '18 at 00:51
  • @kriss: For a very, very broad definition of "glyph metrics", yes. Check out OpenType character combination stuff --- just the base metrics are ***not*** enough. And you can easily iterate by character without having a `.lengthInCharacters`. In fact, high-level languages already do iteration that way. As for the usefulness of a byte length compared to character, you're forgetting the fact that offsets and such work just fine for string slicing (same issues as with code points). – Tim Čas Mar 03 '18 at 13:45
  • @kriss (cont) Basically, you want the *byte* length, not character, for (and the list is incomplete): string concatenation (this includes: printing to console, writing to file, string manipulation), string searches (it's faster than per-character [which will convert down to per-byte anyhow], and with the same limitations), string copies (be it slice or not; again, faster than per-character), et cetera. Granted, searches can be done better, but for that improvement, you need to move *beyond* per-code-point access, and deal with combining characters. – Tim Čas Mar 03 '18 at 13:49
66

I think, it has historical reasons and found this in wikipedia:

At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.

NotDan
  • 31,709
  • 36
  • 116
  • 156
khachik
  • 28,112
  • 9
  • 59
  • 94
  • But that was a long time ago! Why doesn't the standard change so that a string has a 4-byte "Pascal header"? – Mateen Ulhaq Dec 11 '10 at 20:32
  • 2
    @muntoo Hmm... compatibility? – khachik Dec 11 '10 at 20:34
  • 19
    @muntoo: Because that would break monumential amounts of existing C and C++ code. – Billy ONeal Dec 11 '10 at 20:34
  • BASIC had a 4-byte header for it's string; a 2-byte data type and a 2-byte (unsigned int) data length... but we're not talking about neither Pascal or BASIC strings, so stop trying to change the world :) – Yanick Rochon Dec 11 '10 at 20:38
  • 11
    @muntoo: Paradigms come and go, but legacy code is forever. Any future version of C would have to continue to support 0-terminated strings, otherwise 30+ years' worth of legacy code would have to be rewritten (which isn't going to happen). And as long as the old way is available, that's what people will continue to use, since that's what they're familiar with. – John Bode Dec 11 '10 at 23:48
  • 8
    @muntoo: Believe me, sometimes I wish I could. But I'd still prefer 0-terminated strings over Pascal strings. – John Bode Dec 12 '10 at 00:29
  • 2
    Talk about legacy ... C++ strings are now mandated to be NUL-terminated. – Jim Balter Mar 05 '15 at 19:25
34

Calavera is right, but as people don't seem to get his point, I'll provide some code examples.

First, let's consider what C is: a simple language, where all code has a pretty direct translation into machine language. All types fit into registers and on the stack, and it doesn't require an operating system or a big run-time library to run, since it were meant to write these things (a task to which is superbly well-suited, considering there isn't even a likely competitor to this day).

If C had a string type, like int or char, it would be a type which didn't fit in a register or in the stack, and would require memory allocation (with all its supporting infrastructure) to be handled in any way. All of which go against the basic tenets of C.

So, a string in C is:

char s*;

So, let's assume then that this were length-prefixed. Let's write the code to concatenate two strings:

char* concat(char* s1, char* s2)
{
    /* What? What is the type of the length of the string? */
    int l1 = *(int*) s1;
    /* How much? How much must I skip? */
    char *s1s = s1 + sizeof(int);
    int l2 = *(int*) s2;
    char *s2s = s2 + sizeof(int);
    int l3 = l1 + l2;
    char *s3 = (char*) malloc(l3 + sizeof(int));
    char *s3s = s3 + sizeof(int);
    memcpy(s3s, s1s, l1);
    memcpy(s3s + l1, s2s, l2);
    *(int*) s3 = l3;
    return s3;
}

Another alternative would be using a struct to define a string:

struct {
  int len; /* cannot be left implementation-defined */
  char* buf;
}

At this point, all string manipulation would require two allocations to be made, which, in practice, means you'd go through a library to do any handling of it.

The funny thing is... structs like that do exist in C! They are just not used for your day-to-day displaying messages to the user handling.

So, here is the point Calavera is making: there is no string type in C. To do anything with it, you'd have to take a pointer and decode it as a pointer to two different types, and then it becomes very relevant what is the size of a string, and cannot just be left as "implementation defined".

Now, C can handle memory in anyway, and the mem functions in the library (in <string.h>, even!) provide all the tooling you need to handle memory as a pair of pointer and size. The so-called "strings" in C were created for just one purpose: showing messages in the context of writting an operating system intended for text terminals. And, for that, null termination is enough.

Community
  • 1
  • 1
Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
  • 2
    1. +1. 2. Obviously if the default behavior of the language would have been made using length prefixes, there would have been other things to make that easier. For example, all your casts there would have been hidden by calls to `strlen` and friends instead. As for the problem with "leaving it up to the implementation", you could say that the prefix is whatever a `short` is on the target box. Then all your casting would still work. 3. I can come up with contrived scenarios all day long that make one or the other system look bad. – Billy ONeal Dec 13 '10 at 17:56
  • 6
    @Billy The library thing is true enough, aside from the fact that C was designed for minimal or no library usage. The use of prototypes, for instance, was not common early on. Saying the prefix is `short` effectively limits the size of the string, which seems to be one thing they weren't keen on. Myself, having worked with 8-bits BASIC and Pascal strings, fixed-size COBOL strings and similar things, became a huge fan of unlimited-size C strings quickly. Nowadays, a 32-bits size will handle any practical string, but adding those bytes early on was problematic. – Daniel C. Sobral Dec 13 '10 at 18:12
  • 1
    @Billy: First, thank you Daniel... you seem to understand what I'm getting at. Second, Billy, I think you're still missing the point that is being made here. I for one am not arguing the pros and cons of prefixing string **data-types** with their length. What I am saying, and what Daniel very clearly emphasized, is that there was a decision made in the implementation of C to not handle that argument **at all**. Strings don't exist as far as the basic language is concerned. The decision on how to handle strings is left to the programmer... and null termination became popular. – Robert S Ciaccio Dec 20 '10 at 22:40
  • 1
    +1 by me. One further thing I'd like to add; a struct as you propose it misses an important step towards a real `string` type: it is not aware of characters. It's an array of "char" (a "char" in machine lingo is as much a character as a "word" is what humans would call a word in a sentence). A string of characters is a higher-level concept which could be implemented *on top of* an array of `char` if you introduced the notion of encoding. – Frerich Raabe Dec 22 '10 at 09:02
  • @Frerich While that might be true nowadays, a `char` at the time C was created _was_ very much a character. It is only recently that i18n efforts have changed what a "character" means. – Daniel C. Sobral Dec 22 '10 at 13:53
  • 3
    @DanielC.Sobral: Also, the struct you mention wouldn't require two allocations. Either use it as you have it on the stack (so only `buf` requires an allocation), or use `struct string {int len; char buf[]};` and allocate the whole thing with one allocation as a flexible array member, and pass it around as a `string*`. (Or Arguably, `struct string {int capacity; int len; char buf[]};` for obvious performance reasons) – Mooing Duck Jun 18 '14 at 22:03
21

Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlen or the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As Jörgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filename or filename_to_extension impossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).

C's design of letting the programmer choose if/where/how to store the length is much more flexible and powerful. But of course the programmer has to be smart. C punishes stupidity with programs that crash, grind to a halt, or give your enemies root.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • +1. It would be nice to have a standard place to store the length though so that those of us who want something like length prefixing didn't have to write tons of "glue code" everywhere. – Billy ONeal Dec 12 '10 at 04:28
  • 3
    There's no possible standard place relative to the string data, but you can of course use a separate local variable (recomputing it rather than passing it when the latter isn't convenient and the former isn't too wasteful) or a structure with a pointer to the string (and even better, a flag indicating whether the structure "owns" the pointer for allocation purposes or whether it's a reference to a string owned elsewhere. And of course you can include a flexible array member in structure for the flexibility to allocate the string with the structure when it suits you. – R.. GitHub STOP HELPING ICE Dec 12 '10 at 16:09
14

Lazyness, register frugality and portability considering the assembly gut of any language, especially C which is one step above assembly (thus inheriting a lot of assembly legacy code). You would agree as a null char would be useless in those ASCII days, it (and probably as good as an EOF control char ).

let's see in pseudo code

function readString(string) // 1 parameter: 1 register or 1 stact entries
    pointer=addressOf(string) 
    while(string[pointer]!=CONTROL_CHAR) do
        read(string[pointer])
        increment pointer

total 1 register use

case 2

 function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
     pointer=addressOf(string) 
     while(length>0) do 
         read(string[pointer])
         increment pointer
         decrement length

total 2 register used

That might seem shortsighted at that time, but considering the frugality in code and register ( which were PREMIUM at that time, the time when you know, they use punch card ). Thus being faster ( when processor speed could be counted in kHz), this "Hack" was pretty darn good and portable to register-less processor with ease.

For argument sake I will implement 2 common string operation

stringLength(string)
     pointer=addressOf(string)
     while(string[pointer]!=CONTROL_CHAR) do
         increment pointer
     return pointer-addressOf(string)

complexity O(n) where in most case PASCAL string is O(1) because the length of the string is pre-pended to the string structure (that would also mean that this operation would have to be carried in an earlier stage).

concatString(string1,string2)
     length1=stringLength(string1)
     length2=stringLength(string2)
     string3=allocate(string1+string2)
     pointer1=addressOf(string1)
     pointer3=addressOf(string3)
     while(string1[pointer1]!=CONTROL_CHAR) do
         string3[pointer3]=string1[pointer1]
         increment pointer3
         increment pointer1
     pointer2=addressOf(string2)
     while(string2[pointer2]!=CONTROL_CHAR) do
         string3[pointer3]=string2[pointer2]
         increment pointer3
         increment pointer1
     return string3

complexity O(n) and prepending the string length wouldn't change the complexity of the operation, while I admit it would take 3 time less time.

On another hand, if you use PASCAL string you would have to redesign your API for taking in account register length and bit-endianness, PASCAL string got the well known limitation of 255 char (0xFF) beacause the length was stored in 1 byte (8bits), and it you wanted a longer string (16bits->anything) you would have to take in account the architecture in one layer of your code, that would mean in most case incompatible string APIs if you wanted longer string.

Example:

One file was written with your prepended string api on an 8 bit computer and then would have to be read on say a 32 bit computer, what would the lazy program do considers that your 4bytes are the length of the string then allocate that lot of memory then attempt to read that many bytes. Another case would be PPC 32 byte string read(little endian) onto a x86 (big endian), of course if you don't know that one is written by the other there would be trouble. 1 byte length (0x00000001) would become 16777216 (0x0100000) that is 16 MB for reading a 1 byte string. Of course you would say that people should agree on one standard but even 16bit unicode got little and big endianness.

Of course C would have its issues too but, would be very little affected by the issues raised here.

dvhh
  • 4,724
  • 27
  • 33
  • Then why is C's string manipulation less efficient than everywhere else? – Billy ONeal Dec 12 '10 at 05:19
  • @Billy ONeal: Could you define what you mean by less "efficient than everywhere else"? – thing2k Dec 12 '10 at 09:22
  • Efficiency is very subjective – dvhh Dec 12 '10 at 10:29
  • @Billy ONeal, what is efficiency here? C string manipulation is as efficient (in terms of memory, compiled code complexity) as it can get. What makes you think C string manipulations are less efficient? – Andrei Sosnin Dec 12 '10 at 16:40
  • 2
    @deemoowoor: Concat: `O(m+n)` with nullterm strings, `O(n)` typical everywhere else. Length `O(n)` with nullterm strings, `O(1)` everywhere else. Join: `O(n^2)` with nullterm strings, `O(n)` everywhere else. There are some cases where null terminated strings are more efficient (i.e. the just add one to pointer case), but concat and length are by far the most common operations (length at least is required for formatting, file output, console display, etc). If you cache the length to amortize the `O(n)` you've merely made my point that the length should be stored with the string. – Billy ONeal Dec 12 '10 at 16:50
  • 1
    I agree that in today's code this type of string is inefficient and prone to error, but for example Console display don't really have to know the length of the string to display it efficiently, file output didn't really need to know about string length (only allocating cluster on the go), And string formatting at this time was done on a fixed string length in most of the case. Anyway you must be writing bad code if you concat in C has an O(n^2) complexity, I am pretty sure I can write one in O(n) complexity – dvhh Dec 13 '10 at 03:00
  • 1
    @dvhh: I didn't say n^2 -- I said m + n -- it's still linear, but you need to seek to the end of the original string in order to do the concatenation, whereas with a length prefix no seeking is required. (This is really just another consequence of length requiring linear time) – Billy ONeal Dec 13 '10 at 18:25
  • 1
    @Billy ONeal: from mere curiosity I did a grep on my current C project (about 50000 lines of code) for string manipulation function calls. strlen 101, strcpy and variants (strncpy, strlcpy) : 85 (I also have several hundreds of literal strings used for message, implied copies), strcmp: 56, strcat : 13 (and 6 are concatenations to zero length string to call strncat). I agree a length prefixed will speedup calls to strlen, but not to strcpy or strcmp (maybe if strcmp API does not use common prefix). The most interesting thing regarding the above comments is that strcat is very rare. – kriss Dec 13 '10 at 21:30
  • @Billy: The point of this posting is the hidden constant in your `O` notation. Iterating through the characters of a length-prefixed string requires an extra register that isn't required when iterating through the characters of a null-terminated string, which means whatever you're trying to do *with* the string must be implemented with one fewer register, which can impact performance. To help you understand, when I first learned an assembly language, it was for a microprocessor where you essentially had only *three* registers available. –  Mar 06 '15 at 04:59
  • 1
    @Hurkyl: That's not true. In the null terminated case, at each comparison step you need to have the pointer to the string (1 register) load the character it points to (2 registers) and compare with 0 (3 registers). In the length prefixed case you need to compare the pointer to the string (1 register) with the pointer to the end of the string (2 registers) and load the character it points to (3 registers again). – Billy ONeal Mar 06 '15 at 17:05
  • @Billy: Some architectures have an immediate addressing mode so you don't need to load 0 into a register. Some architectures have special registers that are always zero. Some architectures will set a zero flag when you load the character so you don't even have to do a test. Some architectures have "branch if zero" instructions. And even if you aren't on any of those architectures, you can free up the register after the test, unlike the length prefixed version which requires you to keep the end-of-string pointer in register (or reload from memory, I suppose). –  Mar 06 '15 at 19:34
  • @kriss: I would guess `strcat` is rare because it's a poorly-designed method. If it accepted pointers to the start of each string and the end of the allocated space, and returned a pointer to the written zero byte, it could be used safely and efficiently without having to find string lengths in advance. As it is, however, using `strcat` safely and efficiently generally requires that one know the lengths of both strings and the buffer, and in cases where one knows those things, `memcpy` will generally be more efficient. – supercat May 26 '15 at 21:26
  • @supercat: a fun fact about strcat (indeed poorly designed) is that some modern compilers can now correctly optimize it and won't compute some hidden strlen again and again. – kriss May 27 '15 at 07:22
  • @kriss: In thinking about strings, I find myself thinking that knowing that C will continue to be used as memory sizes push beyond four gigs, I could design a string type which would have been very practical in the days of K&R (perhaps even moreso than z-strings, since z-strings often require a separate integer or two to keep track of string length and/or buffer length) but remain practical today. On the other hand, it's not hard to believe that someone trying to design prefixed strings without such foreknowledge might have implemented them... – supercat May 27 '15 at 14:46
  • ...in such a way that would have made writing portable code difficult or would have made it possible to grow strings beyond 255 or 65535 bytes. I'm not sure what mix of prefix styles would have been optimal then or now, but allowing e.g. fixed-sized string 0-127 bytes with a one-byte prefix, variable-length buffers up to 4095 bytes with a two-byte prefix, or bigger with a ((sizeof size_t)+1)-byte prefix, etc. as well as a few "indirect pointer" types would have seemed like a practical mix. The key to making code compact and portable would be to use... – supercat May 27 '15 at 14:53
  • ...standard library methods to convert string pointers into structures identifying the buffer location, buffer length, and string length, and to update the lengths of stored strings. User code could then mostly pass around string pointers and implicitly in so doing pass around the string and buffer lengths as well. Further, if the destination for a method like `strcat` or `sprintf` were a heap-allocated string, the method could automatically adjust the allocation as needed, something that's now impossible. – supercat May 27 '15 at 14:56
  • @supercat: what you describe looks very much like C++ String class. I don't see anything which forbids to code some C equivalent through standard library calls (preferably intrisics, because speed is often an issue), including for instance printf and scanf perimeter. Of course this is something entirely different from the C view of strings like byte arrays, where double quotes are mere syntaxic sugar and would raise many issues: such strings couldn't be used with neutral memory manipulation primitives. – kriss May 27 '15 at 15:36
  • @kriss: The library methods' behaviors are strictly defined in terms of zero-terminated sequences of `char`, so one would have to use methods with other names. The biggest difficulties to working with alternate types are with string literals and library methods that use strings but whose main purpose lies elsewhere (e.g. `fopen`). It's possible to write a macro to allow e.g. `ShortString(fred, "My name is Fred");` to yield `union { struct { char header; char dat[15];} STRINGREF stringref; } fred = {{30, "My name is Fred"}};` and likewise with `MedString`, `LongString`, etc... – supercat May 27 '15 at 16:09
  • ...or to write a method which can turn a length-prefixed string to a `char*` if it's a zero-terminated (it might make sense to have all but the shortest ones be zero-terminated by default, or maybe have all strings zero-terminated and put up with the wasted space) but that's nowhere near as nice as being able to use in-line string literals. – supercat May 27 '15 at 16:11
  • @kriss: Also, I see my approach as a bit different from the C++ String class, since most string-buffer variables would be declared using `MedString(george,255);` in the declaration area [to declare a string buffer with two-byte prefix and room for 255 characters]. For really early C, a separate `InitMedString(george);` step would be required in the executable area [present C compilers would allow `char george[256]; strcpy(george,whatever);` without pre-initializing George, but under my design the normal string-copy method would check the destination length before proceeding. – supercat May 27 '15 at 16:29
  • @kriss: Most implementations of the C++ `String` class are designed to primarily use heap-stored variable-length strings, which are for most purposes more expensive than pre-allocated fixed-length string buffers. What makes my approach different from C is that I would include the current and maximum length with the string data, in a fashion that's efficient for short, medium, and long strings. – supercat May 27 '15 at 16:31
  • 1
    @supercat: not really, look at some implementations. Short strings are using a short stack based buffer (no heap allocation) an only use heap when they get larger. But feel free to provide an actual implementation of your idea as a library. Usually the troubles show up only when we get to the details, not in overall design. – kriss May 28 '15 at 06:01
  • I'm aware of short-string optimization, but it requires that all string instances globally have the same amount of "in-place" allocation; there's no way to declare a variable as reserving 127 characters "in place". As for posting an implementation, where should I best do that? – supercat May 28 '15 at 15:12
10

In many ways, C was primitive. And I loved it.

It was a step above assembly language, giving you nearly the same performance with a language that was much easier to write and maintain.

The null terminator is simple and requires no special support by the language.

Looking back, it doesn't seem that convenient. But I used assembly language back in the 80s and it seemed very convenient at the time. I just think software is continually evolving, and the platforms and tools continually get more and more sophisticated.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • I don't see what's anymore primitive about null terminated strings than anything else. Pascal predates C and it uses length prefixing. Sure, it was limited to 256 characters per string, but simply using a 16 bit field would have solved the problem in the vast majority of cases. – Billy ONeal Dec 12 '10 at 04:29
  • The fact that it limited the number of characters is exactly the type of issues you need to think about when doing something like that. Yes, you could make it longer, but back then bytes mattered. And is a 16-bit field going to be long enough for all cases? C'mon, you must admit that a null-terminate is conceptually primitive. – Jonathan Wood Dec 12 '10 at 04:47
  • 14
    Either you limit the length of the string or you limit the content (no null characters), or you accept the extra overhead of a 4 to 8 byte count. There's no free lunch. At the time of inception the null terminated string made perfect sense. In assembly I sometimes used the top bit of a character to mark the end of a string, saving even one more byte! – Mark Ransom Dec 12 '10 at 05:14
  • Exactly, Mark: There's no free lunch. It's always a compromise. These days, we don't need to make the same sort of compromises. But back then, this approach seemed as good as any other. – Jonathan Wood Dec 12 '10 at 05:34
8

Assuming for a moment that C implemented strings the Pascal way, by prefixing them by length: is a 7 char long string the same DATA TYPE as a 3-char string? If the answer is yes, then what kind of code should the compiler generate when I assign the former to the latter? Should the string be truncated, or automatically resized? If resized, should that operation be protected by a lock as to make it thread safe? The C approach side stepped all these issues, like it or not :)

Cristian
  • 81
  • 1
  • 3
    Err.. no it didn't. The C approach doesn't allow assigning the 7 char long string to the 3 char long string at all. – Billy ONeal Dec 12 '10 at 04:41
  • @Billy ONeal: why not? As far as I understand it in this case, all strings are the same data type (char*), so the length doesn't matter. Unlike Pascal. But that was a limitation of Pascal, rather than a problem with length-prefixed strings. – Oliver Mason Dec 12 '10 at 09:43
  • 5
    @Billy: I think you just restated Cristian's point. C deals with these issues by not dealing with them at all. You're still thinking in terms of C actually containing a notion of a string. It's just a pointer, so you can assign it to whatever you want. – Robert S Ciaccio Dec 12 '10 at 12:48
  • 2
    It's like **the matrix: "there is no string". – Robert S Ciaccio Dec 12 '10 at 12:50
  • 1
    @calavera: I don't see how that proves anything. You can solve it the same way with length prefixing... i.e. don't allow the assignment at all. – Billy ONeal Dec 12 '10 at 16:41
8

Somehow I understood the question to imply there's no compiler support for length-prefixed strings in C. The following example shows, at least you can start your own C string library, where string lengths are counted at compile time, with a construct like this:

#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
    prefix_str_t string1, string2;

    string1 = PREFIX_STR("Hello!");
    string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");

    printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
    printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

    return 0;
}

This won't, however, come with no issues as you need to be careful when to specifically free that string pointer and when it is statically allocated (literal char array).

Edit: As a more direct answer to the question, my view is this was the way C could support both having string length available (as a compile time constant), should you need it, but still with no memory overhead if you want to use only pointers and zero termination.

Of course it seems like working with zero-terminated strings was the recommended practice, since the standard library in general doesn't take string lengths as arguments, and since extracting the length isn't as straightforward code as char * s = "abc", as my example shows.

Pyry Jahkola
  • 613
  • 3
  • 5
  • Problem is that libraries don't know the existence of your struct, and still handle things like embedded nulls incorrectly. Also, this doesn't really answer the question I asked. – Billy ONeal Dec 12 '10 at 07:31
  • 1
    That's true. So the bigger problem is there's no better standard way to provide interfaces with string parameters than plain old zero-terminated strings. I'd still claim, there are libraries which support feeding in pointer-length pairs (well, at least you can construct a C++ std::string with them). – Pyry Jahkola Dec 12 '10 at 08:08
  • 2
    Even if you store a length, you should never allow strings with embedded nulls. This is basic common sense. If your data might have nulls in it, you should never use it with functions which expect strings. – R.. GitHub STOP HELPING ICE Dec 12 '10 at 16:36
  • @R..: Many applications need to pass around arbitrary-length byte sequences. Data-agnostic string types often work well for that purpose. While some might call that "abuse", language designers would regard the inclusion of both data-agnostic string type and a variable-sized binary-data type as redundant. – supercat Mar 05 '15 at 14:43
  • 1
    @supercat: From a standpoint of security I'd welcome that redundancy. Otherwise ignorant (or sleep-deprived) programmers end up concatenating binary data and strings and passing them into things that expect [null-terminated] strings... – R.. GitHub STOP HELPING ICE Mar 05 '15 at 18:21
  • 1
    @R..: While methods that expect null-terminated strings generally expect a `char*`, many methods which don't expect null termination also expect a `char*`. A more significant benefit of separating the types would relate to Unicode behavior. It may be worthwhile for a string implementation to maintain flags for whether strings are known to contain certain kinds of characters, or are known not to contain them [e.g. finding the 999,990th code point in a million-character string which is known not to contain any characters beyond the basic multilingual plane will be orders of magnitude faster... – supercat Mar 05 '15 at 18:54
  • ...than finding the 999,990th code point of a string that might contain such characters]. Such flags would be useless, however, for strings that were used to hold packed binary data. Further, it's often necessary to serialize strings using an encoding different from their internal storage, but binary data should generally be serialized in content-agnostic fashion. Too bad neither Java nor .NET includes a "blob" type. – supercat Mar 05 '15 at 18:58
6

"Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string."

First, extra 3 bytes may be considerable overhead for short strings. In particular, a zero-length string now takes 4 times as much memory. Some of us are using 64-bit machines, so we either need 8 bytes to store a zero-length string, or the string format can't cope with the longest strings the platform supports.

There may also be alignment issues to deal with. Suppose I have a block of memory containing 7 strings, like "solo\0second\0\0four\0five\0\0seventh". The second string starts at offset 5. The hardware may require that 32-bit integers be aligned at an address that is a multiple of 4, so you have to add padding, increasing the overhead even further. The C representation is very memory-efficient in comparison. (Memory-efficiency is good; it helps cache performance, for example.)

Brangdon
  • 627
  • 5
  • 7
  • I believe I addressed all of this in the question. Yes, on x64 platforms a 32 bit prefix can't fit all possible strings. On the other hand, you never want a string that big as a null terminated string, because to do anything you have to examine all 4 billion bytes to find the end for almost every operation you could want to do to it. Also, I'm not saying that null terminated strings are always evil -- if you're building one of these block structures and your specific application is sped up by that kind of construction, go for it. I just wish the default behavior of the language didn't do that. – Billy ONeal Jul 23 '12 at 16:52
  • 2
    I quoted that part of your question because in my view it underrated the efficiency issue. Doubling or quadrupling memory requirements (on 16-bit and 32-bit respectively) can be a big performance cost. Long strings may be slow, but at least they are supported and still work. My other point, about alignment, you don't mention at all. – Brangdon Aug 12 '12 at 15:13
  • Alignment may be dealt with by specifying that values beyond UCHAR_MAX should behave as though packed and unpacked using byte accesses and bit-shifting. A suitably-designed string type could offer storage efficiency essentially comparable to zero-terminated strings, while also allowing bounds-checking on buffers for no additional memory overhead (use one bit in the prefix to say whether a buffer is "full"; if it isn't and the last byte is non-zero, that byte would represent the remaining space. If the buffer isn't full and the last byte is zero, then the last 256 bytes would be unused, so... – supercat Mar 06 '15 at 04:47
  • ...one could store within that space the exact number of unused bytes, with zero additional memory cost). The cost of working with the prefixes would be offset by the ability to use methods like fgets() without having to pass the string length (since buffers would know how big they were). – supercat Mar 06 '15 at 04:50
5

One point not yet mentioned: when C was designed, there were many machines where a 'char' was not eight bits (even today there are DSP platforms where it isn't). If one decides that strings are to be length-prefixed, how many 'char's worth of length prefix should one use? Using two would impose an artificial limit on string length for machines with 8-bit char and 32-bit addressing space, while wasting space on machines with 16-bit char and 16-bit addressing space.

If one wanted to allow arbitrary-length strings to be stored efficiently, and if 'char' were always 8-bits, one could--for some expense in speed and code size--define a scheme were a string prefixed by an even number N would be N/2 bytes long, a string prefixed by an odd value N and an even value M (reading backward) could be ((N-1) + M*char_max)/2, etc. and require that any buffer which claims to offer a certain amount of space to hold a string must allow enough bytes preceding that space to handle the maximum length. The fact that 'char' isn't always 8 bits, however, would complicate such a scheme, since the number of 'char' required to hold a string's length would vary depending upon the CPU architecture.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • The prefix could easily be of implementation-defined size, just as is `sizeof(char)`. – Billy ONeal Jan 25 '12 at 17:31
  • @BillyONeal: `sizeof(char)` is one. Always. One could have the prefix be an implementation-defined size, but it would be awkward. Further, there's no real way of knowing what the "right" size should be. If one is holding lots of 4-character strings, zero-padding would impose 25% overhead, while a four-byte length prefix would impose 100% overhead. Further, the time spent packing and unpacking four-byte length prefixes could exceed the cost of scanning 4-byte strings for the zero byte. – supercat Jan 25 '12 at 17:42
  • 1
    Ah, yes. You're right. The prefix could easily be something other than char though. Anything that would make alignment requirements on the target platform work out would be fine. I'm not going to go there though -- I've already argued this to death. – Billy ONeal Jan 25 '12 at 17:57
  • Assuming strings were length-prefixed, probably the sanest thing to do would be a `size_t` prefix (memory waste be damned, it *would* be the sanest --- allowing strings of any possible length that could possibly fit into memory). In fact, that's *kind of* what D does; arrays are `struct { size_t length; T* ptr; }`, and strings are just arrays of `immutable(char)`. – Tim Čas Feb 10 '15 at 22:44
  • @TimČas: Unless strings are required to be word-aligned, the cost of working with short strings would on many platforms be dominated by the requirement to pack and unpack the length; I really don't see that as being practical. If one wants strings to be content-agnostic arbitrarily-sized byte arrays, I think it would be better to keep the length separate from the pointer to the character data, and have a language allow both pieces of information to be obtained for a literal strings. – supercat Feb 10 '15 at 22:53
  • @supercat: I'm confused by what you mean; the D implementation above has a separate pointer, and you can access it via `str.ptr`. The structure itself is passed by-value (in C terms: it's `ARRAY(char) foo;` and *not* `ARRAY(char)* foo;`). – Tim Čas Feb 10 '15 at 22:58
  • 1
    @TimČas: Sorry--I read your use of "prefix" as referring to a length stored in memory *immediately preceding the characters themselves*, since you said "kind of" what D does, I thought you were expecting strings to be something like `struct {size_t length; char text[]; }` – supercat Feb 10 '15 at 23:03
  • @supercat: Ah, no; it's a char*, for various reasons, including slicing. For example, you can do this: `str[5..10]`, which computes a new array `{ .length = 10 - 5; .ptr = old_ptr + 5; }`. This also works with "plain" `T*` pointers and is used to convert back from pointers to arrays: `ptr[0..len]` – Tim Čas Feb 10 '15 at 23:14
  • @TimČas: That's a fine format, but I wouldn't refer to it as a "length prefix". The systems I would describe as using a "length prefix" (e.g. the classic Macintosh OS, or Turbo Pascal on the PC) store the length immediately preceding the text of the string. – supercat Feb 10 '15 at 23:18
  • @supercat: Indeed. Hence the "kind of" --- not completely the same, but achieving (mostly; actually a bit more) the same effect. – Tim Čas Feb 10 '15 at 23:20
  • @supercat "sizeof(char) is one. Always", is it? I'm sure I used a 16-bit DSP in the 2000s which had sizeof(char)==2. You may argue that it's non-conforming to some standard, but there are non-standard compilers out there in the real world. – Neil Apr 16 '19 at 10:34
  • @Neil: I've used a 16-bit TI DSP where `sizeof (int)` was one, but any compiler where `sizeof (char)` is any value other than one is just plain broken. If `p` and `q` are pointers to consecutive elements of a `T[]`, `q-p` will be one, but `(char*)q - (char*)p` will equal `sizeof (T)`. The only way those can both hold is if `sizeof (char)` is one. – supercat Apr 16 '19 at 13:18
4

The null termination allows for fast pointer based operations.

Sonny Saluja
  • 7,193
  • 2
  • 25
  • 39
  • 6
    Huh? What "fast pointer operations" don't work with length prefixing? More importantly, other languages which use length prefixing are faster than C w.r.t. string manipulation. – Billy ONeal Dec 11 '10 at 20:23
  • 13
    @billy: With length prefixed strings, you can't just take a string pointer and add 4 to it, and expect it to still be a valid string, because it doesn't have a length prefix (not a valid anyway). – Jörgen Sigvardsson Dec 11 '10 at 20:30
  • 1
    @Jorgen: Okay, so you can't cut off the beginning quickly. But you can do everything else (i.e. swapping, shifting, memcpy, memmove, etc) without difficulty. – Billy ONeal Dec 11 '10 at 20:32
  • 1
    @Billy ONeal: All those operations (swapping, memcpy, memmove) have the same time complexity for ASCIIZ strings. Not sure what you mean by "shifting". The only string *modification* with worse time complexity for ASCIIZ strings is removing a suffix, which is O(1) for length-prefixed strings. – j_random_hacker Dec 11 '10 at 20:57
  • 3
    @j_random_hacker: Concatenation is much worse for asciiz strings (O(m+n) instead of potentially O(n)), and concat is much more common than any of the other operations listed here. – Billy ONeal Dec 11 '10 at 21:00
  • 1
    @Billy: to make concat possible in O(n) time you need to store the size of the reserved memory too. Also it's not an asymptotic complexity, since for large n you must reallocate the whole string. – Yakov Galka Dec 11 '10 at 21:10
  • 3
    there's one tiiny little operation that becomes more expensive with null-terminated strings: `strlen`. I'd say that's a bit of a drawback. – jalf Dec 11 '10 at 21:10
  • @ybungalobill: My point is that **everyone else** uses length prefixing, and **everyone else** are faster than C at string operations, despite being slower at most everything else. I don't care much about theoretical complexity, I care more about typical program uses. – Billy ONeal Dec 11 '10 at 21:15
  • 12
    @Billy ONeal: **everyone else** also support regex. So what ? Use libraries that's what they are made for. C is about maximal efficiency and minimalism, not batteries included. C tools also allow you to implement Length Prefixed string using structs very easily. And nothing forbids you to implement the string manipulation programs through managing your own length and char buffers. That's usually what I do when I want efficiency and use C, not calling a handful of functions that expect a zero at the end of a char buffer is not a problem. – kriss Dec 12 '10 at 00:24
  • 2
    @kriss: There's a lot to be said for what's standard behavior though. Libraries are going to want "standard" interfaces for strings, so if you write your own structs/libraries you end up writing tons of glue. – Billy ONeal Dec 12 '10 at 04:39
  • @Jorgen: In C, if you take a string pointer and add 4 to it, you get an invalid string if the original string's length is less than 4. So, no, pointer math is not a guaranteed correct operation. – Mike DeSimone Dec 13 '10 at 03:47
  • Who cares about standard behavior? If you are afraid you are going to mess up, just write a wrapper struct over it (mystring). – Thomas Eding Dec 13 '10 at 04:30
  • @Mike: of course the figure 4 was an arbitrary figure. Suppose you found the last \ at position 4 inside the string, and you just want the filename, and not the entire path. In that case, `filename = path + 4`. You don't have to create a new string before passing it to another function that expects a string. That was my point. :) – Jörgen Sigvardsson Dec 17 '10 at 19:33
  • Just as long as your algorithm doesn't assume it always finds a \. That said, several other path operations (extract path to a file, extract filename sans type suffix) still require copies. I don't think it's enough of a gain to counteract that you can push an entire 4 MiB program through a filename via buffer-overflow exploit, whereas with a single-length-byte system, no matter what there would be a 255-byte limit. – Mike DeSimone Dec 18 '10 at 00:25
4

Not a Rationale necessarily but a counterpoint to length-encoded

  1. Certain forms of dynamic length encoding are superior to static length encoding as far as memory is concerned, it all depends on usage. Just look at UTF-8 for proof. It's essentially an extensible character array for encoding a single character. This uses a single bit for each extended byte. NUL termination uses 8 bits. Length-prefix I think can be reasonably termed infinite length as well by using 64 bits. How often you hit the case of your extra bits is the deciding factor. Only 1 extremely large string? Who cares if you're using 8 or 64 bits? Many small strings (Ie Strings of English words)? Then your prefix costs are a large percentage.

  2. Length-prefixed strings allowing time savings is not a real thing. Whether your supplied data is required to have length provided, you're counting at compile time, or you're truly being provided dynamic data that you must encode as a string. These sizes are computed at some point in the algorithm. A separate variable to store the size of a null terminated string can be provided. Which makes the comparison on time-savings moot. One just has an extra NUL at the end... but if the length encode doesn't include that NUL then there's literally no difference between the two. There's no algorithmic change required at all. Just a pre-pass you have to manually design yourself instead of having a compiler/runtime do it for you. C is mostly about doing things manually.

  3. Length-prefix being optional is a selling point. I don't always need that extra info for an algorithm so being required to do it for a every string makes my precompute+compute time never able to drop below O(n). (Ie hardware random number generator 1-128. I can pull from an "infinite string". Let's say it only generates characters so fast. So our string length changes all the time. But my usage of the data probably doesn't care how many random bytes I have. It just wants the next available unused byte as soon as it can get it after a request. I could be waiting on the device. But I could also have a buffer of characters pre-read. A length comparison is a needless waste of computation. A null check is more efficient.)

  4. Length-prefix is a good guard against buffer overflow? So is sane usage of library functions and implementation. What if I pass in malformed data? My buffer is 2 bytes long but I tell the function it's 7! Ex: If gets() was intended to be used on known data it could've had an internal buffer check that tested compiled buffers and malloc() calls and still follow spec. If it was meant to be used as a pipe for unknown STDIN to arrive at unknown buffer then clearly one can't know abut the buffer size which means a length arg is pointless, you need something else here like a canary check. For that matter, you can't length-prefix some streams and inputs, you just can't. Which means the length check has to be built into the algorithm and not a magic part of the typing system. TL;DR NUL-terminated never had to be unsafe, it just ended up that way via misuse.

  5. counter-counter point: NUL-termination is annoying on binary. You either need to do length-prefix here or transform NUL bytes in some way: escape-codes, range remapping, etc... which of course means more-memory-usage/reduced-information/more-operations-per-byte. Length-prefix mostly wins the war here. The only upside to a transform is that no additional functions have to be written to cover the length-prefix strings. Which means on your more optimized sub-O(n) routines you can have them automatically act as their O(n) equivalents without adding more code. Downside is, of course, time/memory/compression waste when used on NUL heavy strings. Depending on how much of your library you end up duplicating to operate on binary data, it may make sense to work solely with length-prefix strings. That said one could also do the same with length-prefix strings... -1 length could mean NUL-terminated and you could use NUL-terminated strings inside length-terminated.

  6. Concat: "O(n+m) vs O(m)" I'm assuming your referring to m as the total length of the string after concatenating because they both have to have that number of operations minimum (you can't just tack-on to string 1, what if you have to realloc?). And I'm assuming n is a mythical amount of operations you no longer have to do because of a pre-compute. If so, then the answer is simple: pre-compute. If you're insisting you'll always have enough memory to not need to realloc and that's the basis of the big-O notation then the answer is even more simple: do binary search on allocated memory for end of string 1, clearly there's a large swatch of infinite zeros after string 1 for us to not worry about realloc. There, easily got n to log(n) and I barely tried. Which if you recall log(n) is essentially only ever as large as 64 on a real computer, which is essentially like saying O(64+m), which is essentially O(m). (And yes that logic has been used in run-time analysis of real data structures in-use today. It's not bullshit off the top of my head.)

  7. Concat()/Len() again: Memoize results. Easy. Turns all computes into pre-computes if possible/necessary. This is an algorithmic decision. It's not an enforced constraint of the language.

  8. String suffix passing is easier/possible with NUL termination. Depending on how length-prefix is implemented it can be destructive on original string and can sometimes not even be possible. Requiring a copy and pass O(n) instead of O(1).

  9. Argument-passing/de-referencing is less for NUL-terminated versus length-prefix. Obviously because you're passing less information. If you don't need length, then this saves a lot of footprint and allows optimizations.

  10. You can cheat. It's really just a pointer. Who says you have to read it as a string? What if you want to read it as a single character or a float? What if you want to do the opposite and read a float as a string? If you're careful you can do this with NUL-termination. You can't do this with length-prefix, it's a data type distinctly different from a pointer typically. You'd most likely have to build a string byte-by-byte and get the length. Of course if you wanted something like an entire float (probably has a NUL inside it) you'd have to read byte-by-byte anyway, but the details are left to you to decide.

TL;DR Are you using binary data? If no, then NUL-termination allows more algorithmic freedom. If yes, then code quantity vs speed/memory/compression is your main concern. A blend of the two approaches or memoization might be best.

Black
  • 185
  • 9
  • **9** was kinda off-base/mis-represented. Length pre-fix doesn't have this problem. Lenth *passing* as a separate variable does. We were talking about pre-fiix but I got carried away. Still a good thing to think about so I'll leave it there. :d – Black Oct 07 '18 at 23:10
3

Many design decisions surrounding C stem from the fact that when it was originally implemented, parameter passing was somewhat expensive. Given a choice between e.g.

void add_element_to_next(arr, offset)
  char[] arr;
  int offset;
{
  arr[offset] += arr[offset+1];
}

char array[40];

void test()
{
  for (i=0; i<39; i++)
    add_element_to_next(array, i);
}

versus

void add_element_to_next(ptr)
  char *p;
{
  p[0]+=p[1];
}

char array[40];

void test()
{
  int i;
  for (i=0; i<39; i++)
    add_element_to_next(arr+i);
}

the latter would have been slightly cheaper (and thus preferred) since it only required passing one parameter rather than two. If the method being called didn't need to know the base address of the array nor the index within it, passing a single pointer combining the two would be cheaper than passing the values separately.

While there are many reasonable ways in which C could have encoded string lengths, the approaches that had been invented up to that time would have all required functions that should be able to work with part of a string to accept the base address of the string and the desired index as two separate parameters. Using zero-byte termination made it possible to avoid that requirement. Although other approaches would be better with today's machines (modern compilers often pass parameters in registers, and memcpy can be optimized in ways strcpy()-equivalents cannot) enough production code uses zero-byte terminated strings that it's hard to change to anything else.

PS--In exchange for a slight speed penalty on some operations, and a tiny bit of extra overhead on longer strings, it would have been possible to have methods that work with strings accept pointers directly to strings, bounds-checked string buffers, or data structures identifying substrings of another string. A function like "strcat" would have looked something like [modern syntax]

void strcat(unsigned char *dest, unsigned char *src)
{
  struct STRING_INFO d,s;
  str_size_t copy_length;

  get_string_info(&d, dest);
  get_string_info(&s, src);
  if (d.si_buff_size > d.si_length) // Destination is resizable buffer
  {
    copy_length = d.si_buff_size - d.si_length;
    if (s.src_length < copy_length)
      copy_length = s.src_length;
    memcpy(d.buff + d.si_length, s.buff, copy_length);
    d.si_length += copy_length;
    update_string_length(&d);
  }
}

A little bigger than the K&R strcat method, but it would support bounds-checking, which the K&R method doesn't. Further, unlike the current method, it would be possible to easily concatenate an arbitrary substring, e.g.

/* Concatenate 10th through 24th characters from src to dest */

void catpart(unsigned char *dest, unsigned char *src)
{
  struct SUBSTRING_INFO *inf;
  src = temp_substring(&inf, src, 10, 24);
  strcat(dest, src);
}

Note that the lifetime of the string returned by temp_substring would be limited by those of s and src, which ever was shorter (which is why the method requires inf to be passed in--if it was local, it would die when the method returned).

In terms of memory cost, strings and buffers up to 64 bytes would have one byte of overhead (same as zero-terminated strings); longer strings would have slightly more (whether one allowed amounts of overhead between two bytes and the maximum required would be a time/space tradeoff). A special value of the length/mode byte would be used to indicate that a string function was given a structure containing a flag byte, a pointer, and a buffer length (which could then index arbitrarily into any other string).

Of course, K&R didn't implement any such thing, but that's most likely because they didn't want to spend much effort on string handling--an area where even today many languages seem rather anemic.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • There's nothing that would have prevented `char* arr` from pointing to a structure of the form `struct { int length; char characters[ANYSIZE_ARRAY] };` or similar which would still be passable as a single parameter. – Billy ONeal Mar 05 '15 at 21:30
  • @BillyONeal: Two problems with that approach: (1) It would only allow passing the string as a whole, whereas the present approach also allows passing the tail of a string; (2) it will waste significant space when used with small strings. If K&R wanted to spend some time on strings they could have made things much more robust, but I don't think they intended that their new language would be in use ten years later, much less forty. – supercat Mar 06 '15 at 03:50
  • @BillyONeal: I sketched out what I think would probably have been the best design for strings in C (and would probably still be a good design for embedded systems if one didn't mind running code through a preprocessor to convert string literals into suitably-prefixed character arrays. – supercat Mar 06 '15 at 04:26
  • string info structure looks somewhat magical in the exemple. Would it rely on some global hidden array or would char * have a complete change of meaning (ie: only pointer to array, not meaningful any more as a pointer to individual char ?). It's still unclear what is suggested for new catr implementation. Could you elaborate a bit more. – kriss Mar 06 '15 at 08:23
  • @kriss: No magic, implementation-dependent behavior, global variables, etc. required. The first byte of each string header would either identify the target as a zero-length string, a 1-63-byte string, a 1-63 byte buffer with at least one byte empty, a longer string or buffer that would use more bytes to store its length, or `struct SUBSTRING_INFO { unsigned char mflag; struct SUBSTRING_INFO inf; }` [if the latter, `mflag` would set to the byte value that identifies `SUBSTRING_INFO`]. – supercat Mar 06 '15 at 14:07
  • @kriss: The last byte used to store the size would use one bit to indicate whether the target was a string or a non-full buffer. For a non-full buffer, the last byte would indicate if there were 1-255 bytes free, or that the immediately-preceding bytes would indicate how much space was free. The `STRING_INFO` structure would be `char *dat; stringsize_t length,buffsize;`. The `get_string_info` function would look at the target and load the values of `STRING_INFO` appropriately (if the target was a `SUBSTRING_INFO` it would copy the values from that). – supercat Mar 06 '15 at 14:10
  • @kriss: There would be some trade-offs of portability vs speed, but even a fully-portable version would probably outperform zero-terminated strings in most usage scenarios; while the fact that strcpy() doesn't need to know strings' lengths in advance gives it an advantage on short strings, it needs to spend longer on each byte than memcpy--sometimes *much* longer. I would guess that a version of my library which was limited to strings of INT_MAX bytes [whatever that happened to be on a given platform], and was customized for that size, would outperform... – supercat Mar 06 '15 at 14:22
  • ...zero-terminated C strings on almost any platform, for almost any operation involving more than a dozen characters or so [the break-even point would depend upon the operation]. Any idea if there'd be a good place to post it? It's getting a little off-topic here. – supercat Mar 06 '15 at 14:23
  • @supercat: I see, but I'm not sure I agree. This could indeed be in the language, but not as `char *` but as a completely different type. That could be some kind of predefined struct (let's call it "string"). Really it could indeed be the type of what is put between double quotes instead of char *. What you suggests strongly reminds me of Pascal Strings. If Pascal was still there it shouldn't have been hard to make them evolve this way. – kriss Mar 06 '15 at 14:29
  • @supercat: I see, but I'm not sure I agree. This could indeed be in the language, but not as `char *` but as a completely different type. That could be some kind of predefined struct (let's call it "string"). Really it could indeed be the type of what is put between double quotes instead of char *. What you suggests strongly reminds me of Pascal Strings. If Pascal was still there it shouldn't have been hard to make them evolve this way. – kriss Mar 06 '15 at 14:29
  • @kriss: I was assuming `unsigned char*`, rather than `char*`, to point to string headers. From a language perspective, one could say that string literals will have two representations, depending upon length. For length up to `UCHAR_MAX/4`, store the length followed by the text; return a pointer to the length. For longer length, allocate an aligned `unsigned`, followed by a `UCHAR_MAX` byte, followed by the text; return a pointer to the `UCHAR_MAX` byte. Thus, any string literal would yield a pointer to a value `0-UCHAR_MAX/4-1` or else `UCHAR_MAX`. – supercat Mar 06 '15 at 15:11
  • @kriss: I don't think "the char buffer begins with an int containing the length" is any more magical than "the end of the char buffer is terminated with this special character" – Billy ONeal Mar 06 '15 at 17:06
  • @BillyONeal: What do you think of the idea of using either a byte or an int+byte at the start of a string buffer to indicate its length as well as whether it's a fixed string, a resizable string buffer that's full, or a resizable string buffer that's less than full? One could use different flag byte values to enable multiple sizes of length indicator, but adding four or even eight bytes to 64-character strings would be mild overhead compared with adding even two extra bytes to a four-character string. – supercat Mar 06 '15 at 19:09
  • @supercat: I think such a design would have been inpractical in 1975 for code size reasons. Now? No idea. Would need to benchmark to be sure. – Billy ONeal Mar 06 '15 at 20:46
  • @BillyONeal: Deciding on optimal tradeoffs between speed, data size, and code size would have been tricky (and it's likely that someone trying to do so in 1975 would have put in place hard-to-fix architectural limits), but I would think that a good tightly-written string library could have *reduced* overall application code and data size, even in 1975, by avoiding the need for applications to keep track of buffer length, string length, and string content separately, and by allowing efficient operations on substrings (and not just tails). – supercat Mar 06 '15 at 21:06
  • @supercat: the only trouble is that break the semantic of either `unsigned char *` or `char *`. Both of these can be pointers to either one char or several consecutive char, what we got in the habit to call "strings". That's the real reason why a new dedicated string type is needed. – kriss Mar 06 '15 at 23:43
  • @BillyONeal: of course either adding a prefix (one byte or slightly more complicated as supercat suggest) or adding a special ending byte when we use the double quote syntax are both magical behavior. We all know that the first behavior was chosen by Pascal while the second one was chosen by C. The only real point is that if we choose the first option with prefix it's not a `char *` any more but a slightly more complex object best described by a struct in C language. Of course this struct could be predefined as "string" and library functions added. No need to break C type system for that. – kriss Mar 06 '15 at 23:56
  • @kriss: The primary benefit of choosing the second option is that if one has a pointer to a string *which is known to contain at least n characters* one can easily get a pointer to a string containing the portion past the nth character. On the other hand, if one uses a prefix *and leaves some values reserved*, and is willing to use a subroutine call before accessing characters from a string, one can gain lots of abilities including the ability to pass a reference to an *arbitrary* portion of a string [not just the tail], bounds checking, etc. Having such a thing as a language feature... – supercat Mar 07 '15 at 17:53
  • ...would be helpful, since a declaration `string[23] foo;` could allow the compiler to not only allocate 24 bytes for `foo` but initialize the first word so as to identify it as an empty 23-byte buffer. Using bounds-checked buffers would otherwise have required that user code either use separate methods for "store string into unitialized buffer which is known to be large enough" and "store string into bounds-checked buffer"--a little bit of a nuisance--or else a macro to initialize buffers prior to use. Still, I consider it very unfortunate that efforts to save a few bytes... – supercat Mar 07 '15 at 17:58
  • ...on PDP-series hardware have persisted decades later, on platforms where they no longer really save much [and in fact, end up imposing extra costs in any code which wants to be safe]. – supercat Mar 07 '15 at 17:59
  • 1
    This bit about the calling convention is a just-so story with no relation to reality ... it wasn't a consideration in the design. And register-based calling conventions had already been "invented". Also, approaches such as two pointers weren't an option because structs weren't first class ... *only primitives* were assignable or passable; struct copying didn't arrive until UNIX V7. Needing memcpy (which also didn't exist) just to copy a string pointer is a joke. Try writing a full program, not just isolated functions, if you're making a pretense of language design. – Jim Balter Mar 08 '15 at 20:28
  • 1
    "that's most likely because they didn't want to spend much effort on string handling" -- nonsense; the entire application domain of early UNIX was string handling. If it hadn't been for that, we never would have heard of it. – Jim Balter Mar 08 '15 at 20:29
  • 1
    'I don't think "the char buffer begins with an int containing the length" is any more magical' -- it is if you're going to make `str[n]` refer to the right char. These are the sorts of things that the folks discussing this *don't think* about. – Jim Balter Mar 08 '15 at 20:36
  • @JimBalter: What C would really need to make string handling work as I describe would be a means syntax to request a struct allocation followed by an extra `n` elements of the last type or array declared therein. Then one could declare `struct TINYSTR { unsigned char head; char dat[0];`} `struct MEDSTR { unsigned int head; char dat[0];}` and `struct LONGSTR {unsigned long head; char dat[0];`} and `struct ISTRING {char *ptr; unsigned int length; unsigned int alloc; unsigned head; char dat[0];}, and given an initialized variable `v` of any of those types, pass `v.dat` to string methods. – supercat Mar 09 '15 at 00:12
  • @JimBalter: Supporting VLAs would have been easy [don't bother rejecting arrays of size zero, and allow a syntax for requesting larger-than-normal allocations] and would have saved the trouble of code having to fudge around the lack of support. Initializing a buffer's header once would eliminate any further need to pass the buffer's size to string-handling methods. Code that wants to pass an arbitrary part of a string buffer (not just the tail) could create an ISTRING and pass a pointer to its `dat[]` field. In any case, the most important observation... – supercat Mar 09 '15 at 00:17
  • ...is that K&R based the design of C on instruction sets like the PDP series, where accessing a pointer was cheaper than indexing into an array, and passing a pointer into an array was cheaper than passing base and index separately. On many platforms *neither* assumption still holds. Passing base+index means one can use bounds-checking or not as one sees fit, while passing a pointer alone eliminates that possibility. Personally, I'd much rather have the option to bounds-check arrays than decide that any errant array accesses will simply have untrappable UB. – supercat Mar 09 '15 at 00:25
  • K&R did not base C on the PDP instruction set; Ritchie has refuted this rumor in print. Anyway it's irrelevant because this question was about why the C design used NUL-terminated strings, and the repeated harebrained claim by the OP that this was an "inferior" design. The rest of the comments above are also irrelevant, especially the one about VLAs. Now, SO is wisely telling us to avoid extended discussions ... – Jim Balter Mar 09 '15 at 20:20
  • @JimBalter: Would you deny that the design of C and its libraries are largely predicated on the idea that `*dest++ = *src++;` will be faster than `dest[i]=src[i];`? The primary thrust of the first part of my answer--and if you can point me to historical references to help me fix any inaccuracies that would be great--is that C is designed around the concept of passing pointers into the middles of arrays, without any means for the recipient to know anything about the arrays in which they appear, and that is in turn motivated by pointer access being faster than indexed access. You disagree? – supercat Mar 09 '15 at 20:41
2

According to Joel Spolsky in this blog post,

It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

After seeing all the other answers here, I'm convinced that even if this is true, it's only part of the reason for C having null-terminated "strings". That post is quite illuminating as to how simple things like strings can actually be quite hard.

BenK
  • 333
  • 1
  • 6
  • 14
  • 2
    Look, I respect Joel for a lot of things; but this is something where he's speculating. Hans Passant's answer comes directly from C's inventors. – Billy ONeal Jun 24 '16 at 06:12
  • 1
    Yes, but if what Spolsky says is true at all, then it would have been part of the "convenience" they were referring to. That's partly why I included this answer. – BenK Jun 24 '16 at 06:30
  • AFAIK `.ASCIZ` was just an assembler statement to build a sequence of bytes, followed by `0`. It just means that _zero terminated string_ was a well established concept at that time. It does _not_ mean that zero terminated strings were something related to the architecture of a PDP-*, except that you could write tight loops consisting of `MOVB` (copy a byte) and `BNE` (branch if the last byte copied was not zero). – Adrian W Jul 04 '18 at 22:22
  • It supposes to show that C is old, flabby, decrepit language. – purec Sep 30 '18 at 08:43
2

I don't buy the "C has no string" answer. True, C does not support built-in higher-level types but you can still represent data-structures in C and that's what a string is. The fact a string is just a pointer in C does not mean the first N bytes cannot take on special meaning as a the length.

Windows/COM developers will be very familiar with the BSTR type which is exactly like this - a length-prefixed C string where the actual character data starts not at byte 0.

So it seems that the decision to use null-termination is simply what people preferred, not a necessity of the language.

Mr. Boy
  • 60,845
  • 93
  • 320
  • 589
2

One advantage of NUL-termination over length-prefixing, which I have not seen anyone mention, is the simplicity of string comparison. Consider the comparison standard which returns a signed result for less-than, equal, or greater-than. For length-prefixing the algorithm has to be something along the following lines:

  1. Compare the two lengths; record the smaller, and note if they are equal (this last step might be deferred to step 3).
  2. Scan the two character sequences, subtracting characters at matching indices (or use a dual pointer scan). Stop either when the difference is nonzero, returning the difference, or when the number of characters scanned is equal to the smaller length.
  3. When the smaller length is reached, one string is a prefix of the other. Return negative or positive value according to which is shorter, or zero if of equal length.

Contrast this with the NUL-termination algorithm:

  1. Scan the two character sequences, subtracting characters at matching indices [note that this is handled better with moving pointers]. Stop when the difference is nonzero, returning the difference. NOTE: If one string is a PROPER prefix of the other, one of the characters in the subtraction will be NUL, i.e zero, and the comparison will naturally stop there.
  2. If the difference is zero, -only then- check if either character is NUL. If so, return zero, otherwise continue to next character.

The NUL-terminated case is simpler, and very easy to implement efficiently with a dual pointer scan. The length-prefixed case does at least as much work, nearly always more. If your algorithm has to do a lot of string comparisons [e.g a compiler!], the NUL-terminated case wins out. Nowadays that might not be as important, but back in the day, heck yeah.

PMar
  • 21
  • 1
  • 1
    This comparison advantage is only application for relational comparisons. For equality comparisons, length-prefixed strings will generally win out since strings of unequal length can be recognized as unequal without having to examine any of the content, and content comparison can be done using multi-byte chunks without having to stop as soon as a zero byte is found. Further, the advantage isn't really as great as you state since the scenario where the difference is zero will be true of all but the last loop iteration. – supercat Sep 14 '22 at 15:53
-2

gcc accept the codes below:

char s[4] = "abcd";

and it's ok if we treat is as an array of chars but not string. That is, we can access it with s[0], s[1], s[2], and s[3], or even with memcpy(dest, s, 4). But we'll get messy characters when we trying with puts(s), or worse with strcpy(dest, s).

kkaaii
  • 19
  • @Adrian W. This is valid C. Exact length strings are special cased and NUL is omitted for them. This generally an unwise practice but can be useful in cases like populating header structs that use FourCC "strings". – Kevin Thibedeau Sep 05 '19 at 16:47
  • You are right. This is valid C, will compile and behaves as kkaaii described. The reason for the downvotes (not mine...) is probably rather that this answer does not answer OP's question in any way. – Adrian W Sep 06 '19 at 15:44
-6

I think the better question is why you think C owes you anything? C was designed to give you what you need, nothing more. You need to loose the mentality that the language must provide you with everything. Or just continue to use your higher level languages that will give you the luxary of String, Calendar, Containers; and in the case of Java you get one thing in tonnes of variety. Multiple types String, multiple types of unordered_map(s).

Too bad for you, this was not the purpose of C. C was not designed to be a bloated language that offers from a pin to an anchor. Instead you must rely on third party libraries or your own. And there is nothing easier than creating a simple struct that will contain a string and its size.

struct String
{
 const char *s;
 size_t len;
};

You know what the problem is with this though. It is not standard. Another language might decide to organize the len before the string. Another language might decide to use a pointer to end instead. Another might decide to use six pointers to make the String more efficient. However a null terminated string is the most standard format for a string; which you can use to interface with any language. Even Java JNI uses null terminated strings.

Lastly, it is a common saying; the right data structure for the task. If you find that need to know the size of a string more than anything else; well use a string structure that allows you to do that optimally. But don't make claims that that operation is used more than anything else for everybody. Like, why is knowing the size of a string more important than reading its contents. I find that reading the contents of a string is what I mostly do, so I use null terminated strings instead of std::string; which saves me 5 pointers on a GCC compiler. If I can even save 2 pointers that is good.

user13947194
  • 337
  • 5
  • 7