In C, strings are terminated with null ( \0 ) which causes problems when you want to put a null in a strings. Why not have a special escaped character such as \$ or something?
I am fully aware at how dumb this question is, but I was curious.
In C, strings are terminated with null ( \0 ) which causes problems when you want to put a null in a strings. Why not have a special escaped character such as \$ or something?
I am fully aware at how dumb this question is, but I was curious.
Terminating with a 0 has many performance niceties, which were very much relevant back in the late 60s.
CPUs have instructions for conditional jump on test for 0. In fact, some CPUs even have instructions which will iterate/copy a sequence of bytes up to the 0.
If you used an escaped character instead, you have two test TWO different bytes to assert the end of the string. Not only that's slower, but you lose the ability to iterate one byte at a time, as you need a look-ahead or the ability to backtrack.
Now, other languages (cough, Pascal, cough) use strings in a count/value style. For them, any character is valid, but they always keep a counter with the size of the string. The advantage is clear, but there are disadvantages to this technique too.
For one thing, the string size is limited by the number of bytes the count takes. One byte gives you 255 characters, two bytes gives you 65535, etc. It might be almost irrelevant today, but adding two bytes to every string once was quite expensive.
Edit:
I do not think the question is dumb. In these days of high level languages with memory management, incredible CPU power and obscene amounts of memory, such decisions from the past can well seem senseless. And, indeed, they MIGHT be senseless nowadays, so it's a fine thing to question them.
You need to have some actual byte value to terminate a string - how you represent it in code isn't really relevant.
If you used \$
to terminate strings, what byte value would it have in memory? How would you include that byte value in a string?
You're going to hit this problem whatever you do, if you use a special character to terminate strings. The alternative is to use counted strings, whereby the representation of a string includes its length (eg. BSTR).
I guess because it's faster to check, and totally improbable to occur in a reasonable string. Also, remember that C has no concept of strings. A string in C is not something by itself. It's just an array of characters. The fact that it's called and used as a string is purely incidental and conventional.
It causes problems but you can embed a \0 ...
const char* hello = "Hello\0World\0\0";
It causes a problem if you pass this to a standard library functions like strlen
, but not otherwise.
A better solution than any string-terminating character might be to prepend the length of the string like ...
const char* hello = "\x0BHello World";
... which is the way some other languages do it.
If standard library functions like strlen or printf could (option-wise) look for a end-of-string marker \777 (as an alternative to \000), you could have a constant character string containing \0s:
const char* hello = "Hello\0World\0\0\777";
printf("%s\n", hello);
By the way, if you want to send a \0 to stdout (aka -print0) you may use:
putchar(0);
Ditto on the historical reasons.
The creators of std::string in C++ recognized this shortcoming, so std::string can include the null character. (But be careful constructing a std::string with a null character!)
If you want to have a C-string (or rather, a quasi-C-string) with a null character, you will have to make to make your own struct.
typedef struct {
size_t length;
char[] data; //C99 introduced the flexible array member
} my_string;
Or you'll have to keep track of the string length in some other way and pass it to every string function that you write.
Not to necro-post deliberately, but this is still highly relevant for embedded SQL.
If you are dealing with binary data in C, you should be creating a binary object in a data stucture. If you can afford it, an array of char will suffice. It probably isn't a string anyway, is it ?
For hash / digest values, it is common to "HEX" them out into members of {'0',..,'F'}. These can then be "UNHEXED" during the database operation.
For file operations, consider a binary stream, with a logical record length.
Escaping them yourself is only really safe if you can guarantee the encoding. In fact this can be seen in a MYSQLDUMP (SQL) unload where the binaries are properly escaped for UTF-8 say, and the installation scheme is 'pushed' for the load and 'popped' afterwards.
I don't advocate using a dbms call for what should be a library function either, but I have seen it done. (select of real_escape_string ($string)).
And there's base64, which is another can of worms. Google UUENCODE.
So yeah, mem* functions if your characters are fixed width.
There is no reason for a nul character to be part of a string except as a terminator; it has no graphical representation, so you wouldn't see it, nor does it act as a control character. As far as text is concerned, it's as out-of-band a value as you can get without using a different representation (e.g., a multibyte value like 0xFFFF).
To slightly rephrase Michael's question, how would you expect "Hello\0World\0" to be handled?