17

In C, strings are terminated with null ( \0 ) which causes problems when you want to put a null in a strings. Why not have a special escaped character such as \$ or something?

I am fully aware at how dumb this question is, but I was curious.

akway
  • 1,738
  • 4
  • 21
  • 20
  • 5
    What happens when you want to put \$ in a string? – Nick Presta Jul 19 '09 at 22:55
  • 1
    Then you escape the escape character, of course! – Bryan Oakley Jul 19 '09 at 22:58
  • 4
    @Bryan: You can't escape a character, you can only escape the source code representation of a character. Which ever character you use as termination can't be used inside a string. – Guffa Jul 19 '09 at 23:37
  • 1
    You could always use a pascal string instead of a null terminated string. typedef struct _pstr{ int length; char*bits; } pstr; The downside of this approach is that you have to manually manage it and the string functions won't work in it so you have to roll your own (but i'm pretty sure there is a library somewhere for dealing with this). http://en.wikipedia.org/wiki/Pascal_string#Representations – dsm Jul 30 '09 at 13:25
  • 1
    Even C++ uses Pascal strings basically. If C libraries had used Pascal strings, too, we wouldn't have a stupid O(N) strlen. Additionally, we could have avoided tons of security bugs. – Blaisorblade Nov 19 '10 at 12:11
  • @Guffa: of course you _can_ escape a character in data - ever seen UTF-8? Of course, using escaping breaks the 1 byte-1 char relationship. – Blaisorblade Nov 20 '10 at 12:55
  • @Blaisorblade: UTF-8 doesn't use escaping, it's an encoding so the encoded data is not a string at all. – Guffa Nov 20 '10 at 22:52
  • @Blaisorblade: If C had used Pascal strings, we would instead have a ton of failure-to-handle-allocation-failure bugs, since basically everything would require allocating space to copy a string. The only correct way to do strings is with the `char` array and string length stored *separately*, not bound together, so that you can treat substrings of strings as strings *in-place* without copying. C strings (null terminaled) allow that for tails. Pascal strings don't allow it whatsoever. Independent length/pointer pairs allow arbitrary substring referencing. – R.. GitHub STOP HELPING ICE Sep 10 '11 at 03:21
  • @R..: I ignored (on purpose) the differences between Pascal and e.g. C++/Java strings - I want the latter. Sharing is not so easy: Java strings share storage, but doing it in C++ is harder - you basically need some equivalent of garbage collection (e.g. thread-safe reference counting). The same issues would apply in C if you want to make the sharing transparent. This link describes some early further issues with storage sharing in C++ - I believe they have later been solved, but my point is that those issues are nontrivial: http://www.sgi.com/tech/stl/string_discussion.html – Blaisorblade Sep 13 '11 at 18:35
  • @R..: actually, I just used dsm's definition of Pascal strings (which is equivalent to C++/Java's one). – Blaisorblade Sep 13 '11 at 18:41
  • @Guffa: however you call it, you could choose an encoding to use when the termination character appears in the string. – Blaisorblade Sep 13 '11 at 18:43

8 Answers8

39

Terminating with a 0 has many performance niceties, which were very much relevant back in the late 60s.

CPUs have instructions for conditional jump on test for 0. In fact, some CPUs even have instructions which will iterate/copy a sequence of bytes up to the 0.

If you used an escaped character instead, you have two test TWO different bytes to assert the end of the string. Not only that's slower, but you lose the ability to iterate one byte at a time, as you need a look-ahead or the ability to backtrack.

Now, other languages (cough, Pascal, cough) use strings in a count/value style. For them, any character is valid, but they always keep a counter with the size of the string. The advantage is clear, but there are disadvantages to this technique too.

For one thing, the string size is limited by the number of bytes the count takes. One byte gives you 255 characters, two bytes gives you 65535, etc. It might be almost irrelevant today, but adding two bytes to every string once was quite expensive.

Edit:

I do not think the question is dumb. In these days of high level languages with memory management, incredible CPU power and obscene amounts of memory, such decisions from the past can well seem senseless. And, indeed, they MIGHT be senseless nowadays, so it's a fine thing to question them.

Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
  • 5
    +1 for mentioning the CPU. Your "some CPUs" includes Intel's x86 instruction set (though maybe those instructions aren't used much anymore). – ChrisW Jul 19 '09 at 23:08
  • 2
    If you define your own string structure, You can make the 255 value of the size byte, indicate that another size byte follows. – Liran Orevi Jul 19 '09 at 23:09
  • 2
    The performance characteristics are still relevant today in many situations. It's important in embedded systems, and in kernel/driver development where you still want to scrape and save every CPU cycle you can. Which is why C is still king in these areas. – Gerald Jul 19 '09 at 23:28
  • It's senseless to not know Wirth's Law. Especially now, where the hardware trends are to push the envelope in how SMALL a computer can be. – NoMoreZealots Jul 29 '09 at 01:25
  • 1
    @Pete Eddy: that has nothing to do with the issue. I'm talking about decisions made when total available RAM memory was smaller than the memory used by today's CPU registers. The hardware trends are going nowhere close to that. – Daniel C. Sobral Jul 29 '09 at 03:56
  • @Gerald: C++ is fast and uses Pascal strings, which make strlen() O(1) rather than O(N). @Liran: decoding such an encoding is too expensive in most cases. – Blaisorblade Nov 19 '10 at 12:16
  • @Blaisorblade... mainly because std::string and std::wstring share a common base class, so they can't make use of the CPU instructions that act on 0 termination. But std::string is still slower than C strings in almost every circumstance, except for strlen/string::length. – Gerald Nov 19 '10 at 19:44
  • Well, also because they are required to allow 0 characters within the string, which seems a bit silly since it can result in some unintuitive (though documented) behavior. – Gerald Nov 19 '10 at 19:48
  • @ChrisW: comparisons with zero are still faster (see jz), and that's why for (i = N - 1; i >= 0; i--) {...} is slightly faster (I've read that this applies even to Java). On the other hand, "REPNE SCASB", which can implement strlen, and is maybe less used, has no direct special case for 0, it looks for the value in AL (but it's maybe faster to XOR AL, AL than to load other immediates). – Blaisorblade Nov 20 '10 at 13:09
  • @Balisorblade This specifically does not apply to Java -- you can look it up. However, this whole discussion is non-sensical when speaking of modern hardware, where what matter is the cache hit/cache miss ratio, locality, and non-contention of data between threads. – Daniel C. Sobral Nov 20 '10 at 20:35
  • Pascal-style strings generally require one to know in advance the maximum length. One could get around this limitation by using a variable-length field (one byte for 0-127 characters; two for 0-16383; three for 0-2M; four for 0-256M; five for 0-32G). Growing a string may require moving the whole thing by one byte, but that wouldn't be the most horrible thing in the world. Unfortunately, one would lose the ability to have a pointer to the tail end of a string. Not used too terribly often, but very handy in some cases. – supercat Nov 26 '10 at 22:56
13

You need to have some actual byte value to terminate a string - how you represent it in code isn't really relevant.

If you used \$ to terminate strings, what byte value would it have in memory? How would you include that byte value in a string?

You're going to hit this problem whatever you do, if you use a special character to terminate strings. The alternative is to use counted strings, whereby the representation of a string includes its length (eg. BSTR).

RichieHindle
  • 272,464
  • 47
  • 358
  • 399
  • Okay, so \$ would point to some value that is currently unused. – akway Jul 19 '09 at 22:54
  • 4
    But there are no "unused" byte values. Any byte can occur in a C string - you might as well say that \0 was chosen because it was unused. – RichieHindle Jul 19 '09 at 22:56
  • Like what? If you are using UTF-8, then the entire range is used. – Michael Aaron Safyan Jul 19 '09 at 22:56
  • C strings do not support UTF-8, traditionally. UTF-8 did not exist when C was invented, and did not exist for a few decades afterwards. – Daniel C. Sobral Jul 19 '09 at 22:59
  • Except for U+0000, UTF-8 never encodes to a \0. You're probably thinking of UCS-2/UTF-16. – staticsan Jul 19 '09 at 23:10
  • 1
    C libaries NEED a terminal value, Pascal style strings use a length parameter. It was merely a design choice, not the only way to do it. And most people would argue that C's string handling sucks because of it. – NoMoreZealots Jul 19 '09 at 23:11
  • 1
    I don't think "most people" would agree to that. Handling strings in C may not be as elegant as with some other solutions, but from a speed standpoint it's no contest, and C was designed for speed. See Daniel's answer for why. – Gerald Jul 19 '09 at 23:19
  • A C string is just an array of bytes, so its encoding depends on how you interpret it. There is no reason that you can't interpret a const char* sequence as a UTF-8 encoded string. Additionally, a lot of UNIX implementations now will interpret const char* parameters as UTF-8 encoded sequences. Java's non-standard "UTF-8" (actually a variant of CESU-8) encodes embedded nulls as something other than '\0', for the standard UTF-8, NUL is '\0' and will terminate the string. – Michael Aaron Safyan Jul 20 '09 at 16:24
2

I guess because it's faster to check, and totally improbable to occur in a reasonable string. Also, remember that C has no concept of strings. A string in C is not something by itself. It's just an array of characters. The fact that it's called and used as a string is purely incidental and conventional.

Stefano Borini
  • 138,652
  • 96
  • 297
  • 431
1

It causes problems but you can embed a \0 ...

const char* hello = "Hello\0World\0\0";

It causes a problem if you pass this to a standard library functions like strlen, but not otherwise.

A better solution than any string-terminating character might be to prepend the length of the string like ...

const char* hello = "\x0BHello World";

... which is the way some other languages do it.

ChrisW
  • 54,973
  • 13
  • 116
  • 224
  • 1
    Nice examples, but you may want the prefixed string-length in your example to actually reflect the length of the string? (I think you forgot to count the space) – jerryjvl Jul 19 '09 at 23:00
  • Thanks for noting that. I counted to C, re-counted and decided that C was one too many, and then erroneously wrote down A as if C minus 1 was A. I've corrected it now. – ChrisW Jul 19 '09 at 23:05
  • 1
    Reminds me of the old days with Hollerith constants in FORTRAN, so you'd have a string like 16HTHIS IS A STRING. Woe be unto you if you miscounted! The newfangled quoted strings that showed up later were much nicer. – David Thornley Oct 30 '09 at 21:50
0

If standard library functions like strlen or printf could (option-wise) look for a end-of-string marker \777 (as an alternative to \000), you could have a constant character string containing \0s:

const char* hello = "Hello\0World\0\0\777"; 
printf("%s\n", hello); 

By the way, if you want to send a \0 to stdout (aka -print0) you may use:

putchar(0); 
0

Ditto on the historical reasons.

The creators of std::string in C++ recognized this shortcoming, so std::string can include the null character. (But be careful constructing a std::string with a null character!)

If you want to have a C-string (or rather, a quasi-C-string) with a null character, you will have to make to make your own struct.

typedef struct {
    size_t length;
    char[] data; //C99 introduced the flexible array member
} my_string;

Or you'll have to keep track of the string length in some other way and pass it to every string function that you write.

Community
  • 1
  • 1
Paul Draper
  • 78,542
  • 46
  • 206
  • 285
0

Not to necro-post deliberately, but this is still highly relevant for embedded SQL.

If you are dealing with binary data in C, you should be creating a binary object in a data stucture. If you can afford it, an array of char will suffice. It probably isn't a string anyway, is it ?

For hash / digest values, it is common to "HEX" them out into members of {'0',..,'F'}. These can then be "UNHEXED" during the database operation.

For file operations, consider a binary stream, with a logical record length.

Escaping them yourself is only really safe if you can guarantee the encoding. In fact this can be seen in a MYSQLDUMP (SQL) unload where the binaries are properly escaped for UTF-8 say, and the installation scheme is 'pushed' for the load and 'popped' afterwards.

I don't advocate using a dbms call for what should be a library function either, but I have seen it done. (select of real_escape_string ($string)).

And there's base64, which is another can of worms. Google UUENCODE.

So yeah, mem* functions if your characters are fixed width.

mckenzm
  • 1,545
  • 1
  • 12
  • 19
-1

There is no reason for a nul character to be part of a string except as a terminator; it has no graphical representation, so you wouldn't see it, nor does it act as a control character. As far as text is concerned, it's as out-of-band a value as you can get without using a different representation (e.g., a multibyte value like 0xFFFF).

To slightly rephrase Michael's question, how would you expect "Hello\0World\0" to be handled?

John Bode
  • 119,563
  • 19
  • 122
  • 198
  • How do you represent in memory a bag of binary data, which might contain a NUL? The C answer, basically, is "use mem* routines". And if you need to store the length, it goes on with "then invent your own way to store lengths if you need so, and write wrappers for the mem* functions you need". – Blaisorblade Nov 19 '10 at 19:16
  • There are plenty of reasons you might have a zero byte in an array of bytes, or - since C uses 'char' instead of 'byte' - an array of chars. Just remember not to treat this as a string and you'll be fine. A "C string" is a null-terminated char array, though it is in reality not its own data type. That can be the source of confusion. – Paul Draper Nov 09 '12 at 05:30