8

I encountered a somewhat annoying bug today where a string (stored as a char[]) would be printed with junk at the end. The string that was suppose to be printed (using arduino print/write functions) was correct (it correctly included \r and \n). However, there would be junk printed at the end.

I then allocated an extra element to store a '\0' after '\r' and '\n' (which were the last 2 characters in the string to be printed). Then, print() printed the string correctly. It seems '\0' was used to indicate to the print() function that the string had terminated (I remember reading this in Kernighan's C).

This bug appeared in my code which reads from a text file. It occurred to me that I did not encounter '\0' at all when I designed my code. This leads me to believe that '\0' has no practical use in text editors and are merely used by print functions. Is this correct?

Minh Tran
  • 494
  • 7
  • 17
  • 2
    In my opinion, it is perfectly valid for a text file to contain a null (or `\0`), I suggest you start by [finding the size of the file](http://stackoverflow.com/questions/8236/how-do-you-determine-the-size-of-a-file-in-c). – Elliott Frisch Jun 14 '15 at 02:44
  • 5
    If a text file contains null bytes then it is either 1) not really a text file to begin with, but rather is a binary file with textual elements in in, or 2) it is a text file that is encoded using a multi-byte encoding that happens to use null bytes, such as UTF-16. An actual null *character* should never appear in a proper text-only file. – Remy Lebeau Jun 14 '15 at 02:51
  • For some reason, when I sent a note from an old Nokia and opened in Notepad++, it had a NUL at the end. – markoj Sep 24 '22 at 08:59

4 Answers4

11

C strings are terminated by the NUL byte ('\0') - this is implicitly appended to any string literals in double quotes, and used as the terminator by all standard library functions operating on strings. From this it follows that C strings can not contain the '\0' terminator in between other characters, since there would be no way to tell whether it is the actual end of string or not.

(Of course you could handle strings in the C language other than as C strings - e.g., simply adding an integer to record the length of the string would make the terminator unnecessary, but such strings would not be fully interoperable with functions expecting C strings.)

A "text file" in general is not governed by the C standard, and a user of a C program could conceivably give a file containing a NUL byte as input to a C program (which would be unable to handle it "correctly" for the above reasons if it read the file into C strings). However, the NUL byte has no valid reason for existing in a plain text file, and it may be considered at least a de facto standard for text files that they do not contain the NUL byte (or certain other control characters, which might break transmission of that text through some terminals or serial protocols).

I would argue that it is an acceptable (though not necessary!) limitation for a program working on plain text input to not guarantee correct output if there are NUL bytes in the input. However, the programmer should be aware of this possibility regardless of whether it will be treated correctly, and not allow it to cause undefined behaviour in their program. Like all user input, it should be considered "unsafe" in the sense that it can contain anything (e.g., it could be maliciously formed on purpose).

Arkku
  • 41,011
  • 10
  • 62
  • 84
6

This leads me to believe that '\0' has no practical use in text editors and are merely used by print functions. Is this correct?

This is wrong. In C, the end of a character string is designated by the \0 character. This is commonly known as the null terminator. Almost all string functions declared in the C library under <string.h> use this criteria to check or find the end of a string.

A text file, on the other hand, will not typically have any \0 characters in it. So, when reading text from a file, you have to null-terminate your character buffer before you then print it.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Steephen
  • 14,645
  • 7
  • 40
  • 47
1

\0 is the C escape sequence for the null character (ASCII code 0) and is widely used to represent the end of a string in memory. The character normally doesn't appear explicitly in a text file, however, by convention, most C strings contain a null terminator at the end. Functions that read a string into memory will generally append a \0 to denote the end of the string, and functions that output a string from memory will similarly expect a \0.

Note that there are other ways of representing strings in memory, for example as a (length, content) pair (Pascal notably used this representation), which do not require a null terminator since the length of the string is known ahead of time.

casablanca
  • 69,683
  • 7
  • 133
  • 150
1

Common Text Files

The null character '\0', even if rare, can appear in a text file. Code should be prepared to handle reading '\0'.

This also includes other char outside the typical ASCII range, which may be negative with a signed char.

UTF-16

Some "text" files use UTF-16 encoding and code encountering that, but expecting a typical "text" file will encounter many null characters.

Line Length

Lines can be too long, too short (only "\n"). or maybe other "text" problems exist.


Robust code does not trust use/file input until it is qualified and meets expectations. It does not assume null chracters are absent.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • Can you clarify what you mean by "robust code does not trust use/file input until it is qualified and meets expectations"? – Minh Tran Jun 14 '15 at 03:04
  • 3
    @MinhTran It means that you should not make any assumptions about the user's input, e.g., you can't _trust_ it to, say, not include `'\0'` if your program would invoke undefined behaviour as a result. – Arkku Jun 14 '15 at 03:17