Before answering anything, please note that
(note: according to n.m. (see comment's in OP) "a Byte is the smallest quantity available to write out to disk with the C standard library, non-standard libraries may well deal with bits or anything else." So what I said below about WORD sizes being the smallest quantity is probably not very true, but still provides insight nonetheless).
NULL is always 0_decimal (practically)
dec: 0
hex: 0x00000000
bin: 00000000 00000000 00000000 00000000
although it's actual value is defined by a programming language's specification, so use defined constant NULL
instead of hardcoding 0
everywhere (in case it changes, when hell freezes over).
ASCII encoding for character '0' is 48_decimal
dec: 48
hex: 0x00000030
bin: 00000000 00000000 00000000 00110000
The concept of NULL
doesn't exist in a file, but within the generating app's programming language. Just the numeric encoding/value of NULL
exists in a file.
How is it possible that files can contain null bytes in operating
systems written in a language with null-terminating strings (namely,
C)?
With the above stated this question becomes, how can a file contain 0? The answer is now trivial.
For example, if I run this shell code:
$ printf "Hello\00, World!"
test.txt $ xxd test.txt 0000000: 4865
6c6c 6f00 2c20 576f 726c 6421 Hello., World!
I see a null byte in test.txt (at least in OS X). If C uses
null-terminating strings, and OS X is written in C, then how come the
file isn't terminated at the null byte, resulting in the file
containing Hello
instead of Hello\00, World!
?
Is there a fundamental difference between files and strings?
Assuming an ASCII character encoding (1-byte/8-bit characters in the decimal range of 0 and 127):
- Strings are buffers/char-arrays of 1 byte characters (where NULL = 0_decimal and '0' = 48_decimal)).
- Files are sequences of either 32-bit or 64-bit "WORDS" (depends on OS and hardware, ie x86 or x64 respectively).
Therefore, a 32-bit OS file that contains only ASCII strings will be a sequence of 32-bit (4-byte) words that range between the decimal values 0 and 127, essentially using only the first byte of the 4-byte word (b2: base-2, decimal is base-10 and hex base-16, fyi)
0_b2: 00000000 00000000 00000000 00000000
32_b2: 00000000 00000000 00000000 00100000
64_b2: 00000000 00000000 00000000 01000000
96_b2: 00000000 00000000 00000000 01100000
127_b2: 00000000 00000000 00000000 11111111
128_b2: 00000000 00000000 00000001 00000000
Weather this byte is left-most or right-most depends on the OS's endianness.
But to answer your question about the missing NULL
after Hello\00, World!
I'm going to assume that it was substituted by the EOL/EOF (end of file) value, which is most likely non-printable and is why your not seeing it in the output window.
Note: I'm sure modern OS's (and classic Unix based systems) optimize the storage of ASCII characters, so that 1 word (4 bytes) can pack in 4 characters. Things change with UTF however, since these encodings use more bits to store characters, since they have larger alphabets/character sets to represent (like 50k Kanji/Japanese characters). I think UTF-8 is analogus to ASCII, and renamed for uniformity (with UTF-16 and UTF-32).
Note: C/C++ does in fact "pack" 4 characters into a single 4-byte word using character arrays (ie, strings). Since each char is 1-byte, the compiler will allocate and treat it as 1-byte, arithmetically, on the stack or heap. So if you declare an array in a function (ie, an auto-variable), like so
char[] str1[7] = {'H','e','l','l','o','!','\0'};
where the function stack begins at address 1000_b10 (base-10/decimal), then ya have:
072 101 108 108 111 033
addr char binary decimal
---- ----------- -------- -------
1000: str1[0] 'H' 01001000 (072)
1001: str1[1] 'e' 01100101 (101)
1002: str1[2] 'l' 01101100 (108)
1003: str1[3] 'l' 01101100 (108)
1004: str1[4] 'o' 01101111 (111)
1005: str1[5] '!' 00100001 (033)
1006: str1[6] '0' 00000000 (000)
Since RAM is byte-addressable, every address references a single byte.