1

I'm learning C, coming from Python, and I'm trying to understand why printf() behaves this way.

My understanding is that a string in C is an array of characters followed by a null character.

If you declare your string like this:

char string[] = "I am a string";
printf("char: %lu bytes\n", sizeof(char));
printf("%s\n", string);
printf("%lu\n", sizeof(string));

The compiler will insert the null character at the end implicitly. Output:

char: 1 bytes
I am a string
14

Notice how a char is 1 byte in size, there are 13 characters in the string (10 letters and 3 spaces), and the size of the string is 14 bytes. The extra byte is the null character.

Or, we declare our string like this:

char string2[] = {'h', 'e', 'l', 'l', 'o', '\0'};
printf("%s\n", string2);
printf("%lu\n", sizeof(string2));

We manually inserted the null character into our array, and it behaves as expected when we print it or get its size:

hello
6

However, if we don't manually insert the null character, and try to print it:

char string3[] = {'w', 'o', 'r', 'l', 'd'};
printf("%s\n", string3);
printf("%lu\n", sizeof(string3));

we get (or at least, on my machine, I get)

worldhello
5

It prints string2 at the end. We can dig a little deeper to see why:

char *strAddr = string2;
char *str3Addr = string3;
char *lastElemStr3 = &string3[4];

printf("Address of hello string:\t\t\t%p\n", strAddr);
printf("Address of world string:\t\t\t%p\n", str3Addr);
printf("Address of string 3 last element + one byte:\t\t%p\n", lastElemStr3 + 1);

which gives

Address of hello string: 0x7ffee10768a4
Address of world string: 0x7ffee107689f
Address of string 3 last element + one byte: 0x7ffee10768a4

So printf is "overstepping" our array by one byte and printing what ever is in that memory cell - in this case, it's where the previous array happened to be stored. Why does it do that, only in the case that there's no null character at the end? What benefit does this have?

  • You are using two different types of initializers. Why do you expect them to behave the same? – Support Ukraine Jul 09 '21 at 06:40
  • 2
    OT: "Notice how a char is 1 byte in size" well... `sizeof(char)` is always 1. – Support Ukraine Jul 09 '21 at 06:41
  • 1
    `printf` reads beyond array bounds as there's no terminating null character included, thus you provoke undefined behaviour. – Aconcagua Jul 09 '21 at 06:44
  • @4386427 can you elaborate on that? – Holden Nelson Jul 09 '21 at 06:46
  • 1
    @HoldenNelson From the C spec: *"When sizeof is applied to an operand that has type char [...] the result is 1."* – user3386109 Jul 09 '21 at 06:55
  • @HoldenNelson That means that `char` is the basic, smallest unit available on your system – that's not necessarily a 8-bit-byte, though, on some machines `char` can comprise 16 bits. Still `sizeof(char)` remains 1. How many bits a char comprises is stored in `CHAR_BIT` from `limits.h`. Note that all other sizes are calculated as multiples of the size of char. – Aconcagua Jul 09 '21 at 06:59
  • @HoldenNelson Funny fact: There are 16-bit processors out there having 16-bit `char`, `short` *and* `int`, so all of these have a size of 1 in that case. `long` there has 32 bits and thus size of 2. – Aconcagua Jul 09 '21 at 07:00

3 Answers3

5

Why does it do that, only in the case that there's no null character at the end? What benefit does this have?

It seems that you are aware that strings in C are arrays of characters.

Further, you need to understand that arrays in C doesn't have the array-size stored anywhere at run time. You can get the size of an array at compile time using sizeof but once a program is compiled, there is no way to get information about the size of an array.

And... when you pass an array to a function (like when calling printf), it's not the array that is passed but only a pointer to the first element of the array. This also underlines that the called function can't know the array size.

This leads to the problem: How can printf know the number of characters to print when it receives a char array?

This is done by using a special character - the null character - as a sentinel which means "The string ends here". Therefore printf will continue printing characters until it sees the null character.

The downside to this is that you can call printf with a char array that does not contain a null character. If you do that, printf won't know when to stop printing characters and sooner or later it will access memory outside the array bounds. Such cases has undefined behavior, i.e. the C standard doesn't describe what will happen. In C it's your responsebility to make sure this never happens.

On the other hand there are several benefits of this. For instance:

  • Array size doesn't have to match string length (+1), i.e. you can store a string with length 3 in an array with size 10.

  • You don't need memory for holding a "size" field. This also means that you don't need to update a size field at run time. Further, there is no upper limit for string length.

In short you can say: The C-style string convention has benefits in terms of resources (memory, performance in many cases) but it also allow you to do real bad things.

For fun it's often said that:

  • C makes it easy to shoot yourself in the foot.

Using char arrays without a null character element as strings is just one out of many examples of how you can "shoot yourself in the foot".

Support Ukraine
  • 42,271
  • 4
  • 38
  • 63
2

printf looks for the NUL character to determine the string end. In your case, since it finds the \0 at the end of string2, which in your case just happens to be next in memory (your stack contains string3, string2, string2, in that order).

Imagine none of the strings would have a \0 character - printf would read beyond the current function stack, possibly inhaling other variables, strings, passwords, etc...

It could just crash, because the behavior is undefined. But if you're unlucky strange things could happen that would be very hard to debug. At very least, this is a vulnerability in your program.

andreee
  • 4,459
  • 22
  • 42
  • Actually that's undefined behaviour already as reading beyond array bounds. – Aconcagua Jul 09 '21 at 06:44
  • So you're saying that ```printf``` just spits out characters until it finds a nul one? It doesn't care about how many elements were in the initial string or anything like that? – Holden Nelson Jul 09 '21 at 06:49
  • No it doesn't. You're merely passing a pointer, which is basically the starting address for `printf`. In simplified terms, from `printf`'s point of view what you see in memory then is `worldhello\0I am a string\0...someotherdata...`. It just reads until it finds the first `\0` character. In your case, you're lucky to have a NUL-terminated string just after your actual string, but in general there could be _anything_ (e.g. some password that you stored before). That's why you really need to be cautious here. – andreee Jul 09 '21 at 06:53
  • 1
    *'Most probably'* – a crash *can* be a consequence of UB, but there are *at least* as many cases where it won't happen and programme continues with unpredictable behaviour (the latter far heavier to detect so a crash actually means luck for the developper...). – Aconcagua Jul 09 '21 at 06:53
  • @andreee reading from the stack beyond stack bounds is not a stack overflow. A stack overflow is when you fill your stack completely for example with a never ending recursion, or when your local variables take up more space than available on the stack. – Jabberwocky Jul 09 '21 at 07:12
  • @Jabberwocky: Thanks again for your comments, you both are right. I removed the sentence completely. – andreee Jul 09 '21 at 09:21
1

Why does it do that, only in the case that there's no null character at the end? What benefit does this have?

Other answers have explained the mechanics of what's going on. But the other thing you have to remember is that you broke the rules. And when you break the rules, just about anything can happen.

In this case the rule is, as you know, that a string in C is an array of characters terminated by \0. An array of characters not terminated by \0 is not a proper string. printf expects to be handed a proper string. (Well, actually, it expects to be handed a pointer to a proper string's first element.) If you pass printf an improper string, anything can happen.

What benefit does this have? To you, the user, pretty much none at all. This is undefined behavior, which you should not be depending on. (You can't depend on it, because you can't predict what it will do. It might do something completely different next week.)

Steve Summit
  • 45,437
  • 7
  • 70
  • 103