How can a file contain null bytes?

Question

How is it possible that files can contain null bytes in operating systems written in a language with null-terminating strings (namely, C)?

For example, if I run this shell code:

$ printf "Hello\00, World!" > test.txt
$ xxd test.txt
0000000: 4865 6c6c 6f00 2c20 576f 726c 6421       Hello., World!

I see a null byte in test.txt (at least in OS X). If C uses null-terminating strings, and OS X is written in C, then how come the file isn't terminated at the null byte, resulting in the file containing Hello instead of Hello\00, World!? Is there a fundamental difference between files and strings?

With repeated use of `fputc(0, ostream)` or `fprintf(ostream, "%c", 0)`, it is easy to create a file of many null characters. — chux - Reinstate Monica, Jan 05 '16 at 20:52
strings are not the only things that can be written to files. — n. m. could be an AI, Jan 05 '16 at 20:57
Note: many languages that are not C, but whose interpreters are written in C, still allow null bytes in strings. (For example Lua) — user253751, Jan 06 '16 at 05:19
@n.m. What is the smallest quantity the (ANSI) C standard library file IO functions will allow to be written? I'm assuming a byte, but I suppose a single bit may be possible, though I'm doubting it. — samus, Jan 07 '16 at 14:44
@SamusArin Byte is the smallest quantity available with the standard library, non-standard libraries may well deal with bits or anything else. — n. m. could be an AI, Jan 07 '16 at 15:05
@trojanfoe Why does the nature of the question being asked alone not make you realize RK. is a beginner to systems programming, not to mention his score!? — samus, Jan 07 '16 at 15:16
@SamusArin That's a user-level concept and I would expect a developer of any level to know that files can contain binary data. — trojanfoe, Jan 07 '16 at 15:17
@trojanfoe Indeed. We'll stating that "Your are assuming all files are text." is very helpful and relevant, asking "Why" seemed a bit condescending is all. — samus, Jan 07 '16 at 15:27
@SamusArin I would have liked to know where he got that assumption from, hence the comment. — trojanfoe, Jan 07 '16 at 15:29

dbush · Accepted Answer · 2016-01-11T03:35:12.760

Null-terminated strings are a C construct used to determine the end of a sequence of characters intended to be used as a string. String manipulation functions such as strcmp, strcpy, strchr, and others use this construct to perform their duties.

But you can still read and write binary data that contains null bytes within your program as well as to and from files. You just can't treat them as strings.

Here's an example of how this works:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    FILE *fp = fopen("out1","w");
    if (fp == NULL) {
        perror("fopen failed");
        exit(1);
    }

    int a1[] = { 0x12345678, 0x33220011, 0x0, 0x445566 };
    char a2[] =  { 0x22, 0x33, 0x0, 0x66 };
    char a3[] = "Hello\x0World";

    // this writes the whole array
    fwrite(a1, sizeof(a1[0]), 4, fp);
    // so does this
    fwrite(a2, sizeof(a2[0]), 4, fp);
    // this does not write the whole array -- only "Hello" is written
    fprintf(fp, "%s\n", a3);
    // but this does
    fwrite(a3, sizeof(a3[0]), 12, fp);
    fclose(fp);
    return 0;
}

Contents of out1:

[dbush@db-centos tmp]$ xxd out1
0000000: 7856 3412 1100 2233 0000 0000 6655 4400  xV4..."3....fUD.
0000010: 2233 0066 4865 6c6c 6f0a 4865 6c6c 6f00  "3.fHello.Hello.
0000020: 576f 726c 6400                           World.

For the first array, because we use the fwrite function and tell it to write 4 elements the size of an int, all the values in the array appear in the file. You can see from the output that all values are written, the values are 32-bit, and each value is written in little-endian byte order. We can also see that the second and fourth elements of the array each contain one null byte, while the third value being 0 has 4 null bytes, and all appear in the file.

We also use fwrite on the second array, which contains elements of type char, and we again see that all array elements appear in the file. In particular, the third value in the array is 0, which consists of a single null byte that also appears in the file.

The third array is first written with the fprintf function using a %s format specifier which expects a string. It writes the first 5 bytes of this array to the file before encountering the null byte, after which it stops reading the array. It then prints a newline character (0x0a) as per the format.

The third array it written to the file again, this time using fwrite. The string constant "Hello\x0World" contains 12 bytes: 5 for "Hello", one for the explicit null byte, 5 for "World", and one for the null byte that implicitly ends the string constant. Since fwrite is given the full size of the array (12), it writes all of those bytes. Indeed, looking at the file contents, we see each of those bytes.

As a side note, in each of the fwrite calls, I've hardcoded the size of the array for the third parameter instead of using a more dynamic expression such as sizeof(a1)/sizeof(a1[0]) to make it more clear exactly how many bytes are being written in each case.

"NULL" best used to described a _null pointer constant_ or `NULL`. Recommend "null character" or `'\0'`` to describe the terminating character of a string. — chux - Reinstate Monica, Jan 05 '16 at 21:16
ascii(7) calls it `NUL` uppercase, one `L`. `NULL` is another name for `(void *)0`. *Null* is a word in the English language, and as is usual the meaning is somewhat ambiguous. — Jasen, Jan 06 '16 at 03:05
More correctly, you can't treat them as C strings. C Strings aren't the only representation of strings possible. — Aron, Jan 06 '16 at 07:07
@SamusArin For example, UTF-16, if you convert a standard ASCII C-String into a UTF-16 string, every other byte would be null. — Aron, Jan 07 '16 at 16:01
@Aron Oh I see, I was trying to fathom other (primitive) "string" data types (other than char* and char[]). — samus, Jan 07 '16 at 16:04
@samis Late to the party, but prior to C, it was actually more common to preface the string with the first byte being its size (like `"\006Hello!"`). This is sometimes called a Pascal string, due to it being the format used for the primitive `string` type in Pascal. I imagine it fell out of use mainly because it limits strings to 255 characters max. — Tyg13, May 24 '19 at 18:02

Sergey Kalinichenko · Answer 2 · 2016-01-05T21:02:39.230

21

Null-terminated strings are certainly not the only thing that you can put into a file. Operating system code does not consider a file to be a vehicle for storing null-terminated strings: an operating system presents a file as a collection of arbitrary bytes.

As far as C is concerned, I/O APIs exist for writing files in binary mode. Here is an example:

char buffer[] = {0, 1, 0, 2, 0, 3, 0, 4, 0, 5};
FILE *f = fopen("data.bin","wb");  // "w" is for write, "b" is for binary
fwrite(buffer, 1, sizeof(buffer), f);

This C code creates a file called "data.bin", and writes ten bytes into it. Note that although buffer is a character array, it is not a null-terminated string.

edited Jan 05 '16 at 21:02

answered Jan 05 '16 at 20:57

Sergey Kalinichenko

714,442
84
1,110
1,523

1

@WeatherVane Thanks! That must have been an octal ten :-) – Sergey Kalinichenko Jan 05 '16 at 21:03
1

As an aside: are there any words equivalent to `"ten" = 10(decimal)` for `10(octal)` and `10(hexadecimal)`? – Weather Vane Jan 05 '16 at 21:06
1

@WeatherVane I have no idea, but it might be a good question for english.stackexchange.com :-) – Sergey Kalinichenko Jan 05 '16 at 21:09
1

@WeatherVane I just say "ten". Counting in binary: `one,ten,eleven,one hundred,one hundred one`, in octal: `one,two,three,four,five,six,seven,ten,eleven`. This is probably the wrong thing to do, but to that I shrug. – briantist Jan 06 '16 at 04:24
@WeatherVane: In spoken English (for example, debugging over the phone (yes, I've actually done that)) I normally say "octal one zero" and "hex one zero" respectively. Just like 1.10 is "one point one zero" and not "one point ten" (unless if you're French I've been told) – slebetman Jan 06 '16 at 04:58
3

@WeatherVane The word for `10(octal)` is "eight". The word for `10(hexadecimal)` is "sixteen". "One zero" works too (for either one), but ***not*** "ten". – user253751 Jan 06 '16 at 05:17
3

Note that on UNIX-like systems (POSIX systems), the `b` in the open mode is immaterial. On Windows systems, the `b` really matters; it affects the interpretation of carriage return and control-Z characters as the data is read, and the interpretation of newline characters as the data is written. – Jonathan Leffler Jan 06 '16 at 14:30
@immibis I certainly was not suggesting using "ten" for `10(hexadecimal)`. But I can't agree with using "sixteen", it is plain confusing, because it could be taken to mean `16(hexadecimal)`. I can go with "two" for `10(binary)` and "eight" for `10(octal)`, in the same way that "ten" for `10(decimal)` is not itself a digit. Was just asking, has a name been coined? – Weather Vane Jan 06 '16 at 19:02

score 8 · Answer 3 · edited Jan 06 '16 at 14:37

Because a file is just a stream of bytes, of any byte including null byte. Some files are called text files when they only contain a subset of all the possible bytes: the printable ones (roughly alphanumeric, spaces, punctuation).

C strings are sequence of bytes terminated by a null byte, just a matter of convention. They are too often the source of confusion; just a sequence terminated by null, means any non-null byte terminated by null is a correct C string! Even one that contains a non printable byte, or a control char. Be careful because your example is not a C one! In C printf("dummy\000foo"); will never print foo as printf will consider the C string starting at d and ending at the null byte in the middle. Some compilers complains about such a C string literal.

Now there is no direct link in between C strings (that generally also contains only printable char) and text file. While printing a C string into a file generally consists in storing only its subsequence of non null bytes.

Danny_ds · Answer 4 · 2016-01-07T00:42:33.067

5

While null-bytes are used to terminate strings and needed for string manipulation functions (so they know where the string ends), in binary files \0 bytes can be everywhere.

Consider a binary file with 32-bit numbers for example, they will all contain null-bytes if their values are smaller than 2^24 (for example: 0x001a00c7, or 64-bit 0x0000000a00001a4d).

Idem for Unicode-16 where all ASCII characters have a leading or trailing \0, depending on their endianness, and strings need to end with \0\0.

A lot of files even have blocks padded (to 4kB or even 64kB) with \0 bytes, to have quick access to the desired blocks.

For even more null-bytes in a file, take a look at sparse files, where all bytes are \0 by default, and blocks full of null-bytes aren't even stored on disk to save space.

edited Jan 07 '16 at 00:42

answered Jan 05 '16 at 20:59

Danny_ds

11,201
1
24
46

2

Good point about UTF-16. Yet if ASCII has a leading `'\0'` or trailing `'\0'` depends on if the file is [UTF-16LE or UTF-16BE](https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes). – chux - Reinstate Monica Jan 05 '16 at 21:57
@chux - Indeed, same for binary numbers. I edited the answer. Thanks. – Danny_ds Jan 07 '16 at 00:44

score 0 · Answer 5 · answered Jan 06 '16 at 15:41

Consider the usual C function calls for writing data to files — write(2):

ssize_t
write(int fildes, const void *buf, size_t nbyte);

… and fwrite(3):

size_t
fwrite(const void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);

Neither of these functions accept a const char * NUL-terminated string. Rather, they take an array of bytes (a const void *) with an explicit size. These functions treat NUL bytes just like any other byte value.

samus · Answer 6 · 2016-08-16T14:33:42.477

Before answering anything, please note that

(note: according to n.m. (see comment's in OP) "a Byte is the smallest quantity available to write out to disk with the C standard library, non-standard libraries may well deal with bits or anything else." So what I said below about WORD sizes being the smallest quantity is probably not very true, but still provides insight nonetheless).

NULL is always 0_decimal (practically)

dec: 0
hex: 0x00000000
bin: 00000000 00000000 00000000 00000000

although it's actual value is defined by a programming language's specification, so use defined constant NULL instead of hardcoding 0 everywhere (in case it changes, when hell freezes over).

ASCII encoding for character '0' is 48_decimal

dec: 48
hex: 0x00000030
bin: 00000000 00000000 00000000 00110000

The concept of NULL doesn't exist in a file, but within the generating app's programming language. Just the numeric encoding/value of NULL exists in a file.

How is it possible that files can contain null bytes in operating systems written in a language with null-terminating strings (namely, C)?

With the above stated this question becomes, how can a file contain 0? The answer is now trivial.

For example, if I run this shell code:
$ printf "Hello\00, World!" 
test.txt $ xxd test.txt 0000000: 4865
6c6c 6f00 2c20 576f 726c 6421            Hello., World!
I see a null byte in test.txt (at least in OS X). If C uses null-terminating strings, and OS X is written in C, then how come the file isn't terminated at the null byte, resulting in the file containing Hello instead of Hello\00, World!?

Is there a fundamental difference between files and strings?

Assuming an ASCII character encoding (1-byte/8-bit characters in the decimal range of 0 and 127):

Strings are buffers/char-arrays of 1 byte characters (where NULL = 0_decimal and '0' = 48_decimal)).
Files are sequences of either 32-bit or 64-bit "WORDS" (depends on OS and hardware, ie x86 or x64 respectively).

Therefore, a 32-bit OS file that contains only ASCII strings will be a sequence of 32-bit (4-byte) words that range between the decimal values 0 and 127, essentially using only the first byte of the 4-byte word (b2: base-2, decimal is base-10 and hex base-16, fyi)

  0_b2: 00000000 00000000 00000000 00000000
 32_b2: 00000000 00000000 00000000 00100000
 64_b2: 00000000 00000000 00000000 01000000
 96_b2: 00000000 00000000 00000000 01100000
127_b2: 00000000 00000000 00000000 11111111
128_b2: 00000000 00000000 00000001 00000000

Weather this byte is left-most or right-most depends on the OS's endianness.

But to answer your question about the missing NULL after Hello\00, World! I'm going to assume that it was substituted by the EOL/EOF (end of file) value, which is most likely non-printable and is why your not seeing it in the output window.

Note: I'm sure modern OS's (and classic Unix based systems) optimize the storage of ASCII characters, so that 1 word (4 bytes) can pack in 4 characters. Things change with UTF however, since these encodings use more bits to store characters, since they have larger alphabets/character sets to represent (like 50k Kanji/Japanese characters). I think UTF-8 is analogus to ASCII, and renamed for uniformity (with UTF-16 and UTF-32).

Note: C/C++ does in fact "pack" 4 characters into a single 4-byte word using character arrays (ie, strings). Since each char is 1-byte, the compiler will allocate and treat it as 1-byte, arithmetically, on the stack or heap. So if you declare an array in a function (ie, an auto-variable), like so

char[] str1[7] = {'H','e','l','l','o','!','\0'};

where the function stack begins at address 1000_b10 (base-10/decimal), then ya have:

072 101 108 108 111 033

addr  char        binary   decimal
----  ----------- -------- -------
1000: str1[0] 'H' ‭01001000‬ (072)
1001: str1[1] 'e' ‭01100101‬ (101)
1002: str1[2] 'l' ‭01101100‬ (108)
1003: str1[3] 'l' ‭01101100‬ (108)
1004: str1[4] 'o' ‭01101111‬ (111)
1005: str1[5] '!' ‭00100001‬ (033)
1006: str1[6] '0' 00000000 (000)

Since RAM is byte-addressable, every address references a single byte.

Why on Earth would any decent programmer need to spam other's (unrelated) answers with (misspelt) insults and links to their own answers. Suggest you refrain before you are suspended :) P.S. Your statements about UTF-8 are completely wrong as is your assumption about EOF conversion. You do not answer the rather straight-forward question asked very clearly at all. — iCollect.it Ltd, Jan 07 '16 at 18:11
That wasn't the spelling error. There were another 4 unintentional ones (if you ignore all the gangsta rap spelling). Your = *You're*, finder = *finer*, point's = *points*, programmer = *programming*. Suggest you play nice on SO or you won't last long :) — iCollect.it Ltd, Jan 07 '16 at 20:56
What is that answer???? Full of unrelated and wrongs/too vague things. One: *I think UTF-8 is analogus to ASCII*. These are almost unrelated things. UTF-8 is a coding system, ASCII is a charset. There is some relation but not of this kind. And what about your second note ???? — Jean-Baptiste Yunès, Jan 09 '16 at 09:04
ASCII 'd' tells you what is the value associated to it not how this value must be encoded! UTF-8 tells you how the value must be represented in a machine (variable multi-byte encoding). Second, char are one byte long, so str[1] could not be at 1008, isn't that right? And the fundamental problem of OP is not related to character encoding but to the role of ASCII NUL in various places (files and strings) — Jean-Baptiste Yunès, Jan 11 '16 at 17:21
@Jean-BaptisteYunès https://en.wikipedia.org/wiki/UTF-8: "UTF-8 is a character encoding... It was designed for backward compatibility with ASCII..." — samus, Aug 16 '16 at 13:41
@Jean-BaptisteYunès Your right, str[1] is at the next memory location, 1001. I confused byte-address-ability of RAM with it's memory locations. — samus, Aug 16 '16 at 14:32

How can a file contain null bytes?

6 Answers6

Linked