0

I know that strlen() does not count the NUL-terminating character with. I really know that this is a fact. Thus, this question is NOT about asking for why strlen() might "presumably" not return the right string length, which is already asked and answered alot well here on StackOverflow, f.e. in this thread, or this one.

So lets go ahead to my question:

In ISO/IEC 9899:1990 (E); 7.1.1., is stated:

A string is a contiguous sequence of characters terminated by and including the first null character.

What is the reason, why strlen() deviate from this formed standard, and does not "want" to accept a string with its NUL-terminating character?

Why?

Hatted Rooster
  • 35,759
  • 6
  • 62
  • 122
  • 3
    The terminating NUL byte is metadata (and an implementation detail), the other characters are the data. You could easily make a `strlen_w_terminator` function. – Eljay Oct 19 '19 at 12:47
  • Because changing it would break so much code that it might as well be adopted by the standard. Remember the 1st `C` standard came along and basically codified the existing working practices of the language, – Richard Critten Oct 19 '19 at 12:47
  • The typical use-case is to get the length of the actual characters in the string, which excludes the terminator. It also follows the zero-based indexing semantics, so that `some_string[strlen(some_string)]` will always be the terminator. – Some programmer dude Oct 19 '19 at 12:50
  • @Eljay Who says its metadata? And even even if it should be classified as "metadata", it is defined to be part of the string, and in fact, it is. – RobertS supports Monica Cellio Oct 19 '19 at 12:50
  • 1
    When you concatenate two strings together, the terminator is not retained as part of the first string. Because it is metadata. It is used to mark the termination of the string. It's not part of the data itself. `std::string` does not treat `'\0'` as metadata, it is part of the string itself. – Eljay Oct 19 '19 at 12:53
  • If someone asks you what the length of the string `"ABC"` is, you would probably answer "Three". `strlen()` is specified to do the same, despite that string being represented by four characters. Practically, it is more important that the behaviour of `strlen()` is consistent across implementations than whether the nul terminator is counted. – Peter Oct 19 '19 at 13:15
  • 2
    It is compliant with other languages. I.e. in MS basic a string is a structure composed by a field that holds the length of the string and an array holding the string. Using `Len` basic operator what you expect to retrieve the number of effective characters in the string or the size of whole structure? As many already said the terminating `\0` null character is **not part of the string**, but a functional part of the type representation. Adding that in **C** doesn't exist a native string type, consider it a **composite type**, which method `strlen` returns the effective user characters. – Frankie_C Oct 19 '19 at 13:27

4 Answers4

7

Because you would expect this pseudocode's assertion to hold true:

str1 = "foo"
str2 = "bar"
str3 = concatenate(str1, str2)

Assert strlen(str1) + strlen(s2) == strlen(str3)

If terminating '\0' was counted by strlen, above assertion would not hold, which would be much more of overall headache, than what the current C string behavior is. More importantly, it would in my opinion be quite unintuitive and illogical.

hyde
  • 60,639
  • 21
  • 115
  • 176
  • Also allows `char* strcat( char* dest, const char* src) { strcpy(dest + strlen(dest), src); return dest; }` ... some things are more convenient, other things have to `+1` or `-1` to accommodate the physical representation. – Eljay Oct 19 '19 at 13:20
  • @Eljay My point isn't really about convenience, but about what is logical. Edited the question to emphasize this. – hyde Oct 19 '19 at 13:39
3

Taking your doubt as a reasonable point we can state that: The C-string consists of two parts:

  1. the string's useful content ("the text");
  2. the null terminating character;

The null terminating character is purely a technical measure for determination of the end of the string by the C-originated library functions. Still, if one types a declaration:

char * str = "some string";

they logically would rather expect its length to be 11 which is as many as they can see in this statement. Hence the strlen() value yields only the length of the part 1. of the string.

bloody
  • 1,131
  • 11
  • 17
  • @Yunnosch I don't want to judge if *unrewarded* but it's good to hear a nice word at the end of the work-week. Thank you very much :) – bloody Jan 08 '21 at 18:37
3

Not really an answer to your question, but consider this example:

char string[] = "string";
printf("sizeof: %zu\n", sizeof(string));
printf("strlen: %zu\n", strlen(string));

This prints

sizeof: 7
strlen: 6

So sizeof counts the \0, but strlen doesn't.

Questions like this, that ask why a certain age-old decision was made one way and not another way, are hard to answer. I can say that it's perfectly obvious to me, anyway, that strlen should count just the real, "interesting" characters that are in the string, and ignore the \0 at the end that merely terminates it. I'm used to accounting for the \0 separately. I imagine it would have been considerably more of a nuisance overall if strlen had been defined the other way. But I can't prove this with convincing arguments, and I've been using strlen with its current definition for so long that I'm probably hopelessly biased; I might be saying "it's perfectly obvious to me that..." even if strlen's definition were quite wrong.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
3

There is a difference between the physical, stored representation of a C style string and the logical representation of a C style string.

The physical representation, how the string is actually stored in memory or other media includes the null character. The null character is included when discussing the physical representation because it take up an additional piece of storage. In order to be a C style string the null character must be stored.

However the logical representation of a string does not include the null character. The logical representation of a string includes only the text characters that the programmer is wanting to manipulate.

I suspect that the null character, a value of binary zero, was chosen because of the original ASCII character set defined a character value of zero as the NULL character. Part of the lower values among the various teletype control codes, it seems to be the least likely ASCII character that may appear in text. See ASCII Character Codes.

Another nice quality of using a binary zero as the string terminator is that is the value that represents logical false so iterating over a string is often a matter of incrementing an array index or incrementing a pointer while logical true since all characters other than the end of string indicator have a non-zero or logical true value.

Due to how close to the hardware that the C programming language is, the programmer needs to be concerned about both representations, the physical representation when allocating memory to store a string which includes the null character and the logical representation which is the string without the null character.

The various C style string manipulation functions in the Standard Library (strlen(), strcpy(), etc.) are all designed around the logical representation of a C style string. They perform their actions by using the null character as not being part of the text but rather as a special indicator character which indicates the end of the string. However as a part of their operations they need to be aware of the null character and its use as a special symbol. For instance when strcpy() or strcat() are used to copy strings, they must also copy the null character that indicates the end of the string even though it is not part of the actual text of the logical representation.

This choice allows text strings to be stored as arrays of characters, as befits the hardware orientation and efficiency characteristics of C. There is no need to create an additional built in type for text strings and it fits well with the lean character of the C programming language.

C++ is able to provide the std::string because of being object oriented and having the additional facilities of the language that allows for objects to be created and managed. The C programming language, due to its simple syntax and lack of object oriented facilities does not have this convenience.

The problem with this approach is that the programmer needs to be aware of both the physical representation and the logical representation of text strings and be able to accommodate the needs of both when writing programs.

Richard Chambers
  • 16,643
  • 4
  • 81
  • 106