1

I've seen several usage of fgets (for example, here) that go like this:

char buff[7]="";

(...)

fgets(buff, sizeof(buff), stdin);

The interest being that, if I supply a long input like "aaaaaaaaaaa", fgets will truncate it to "aaaaaa" here, because the 7th character will be used to store '\0'.

However, when doing this:

int i=0;
for (i=0;i<7;i++)
{
    buff[i]='a';
}
printf("%s\n",buff);

I will always get 7 'a's, and the program will not crash. But if I try to write 8 'a's, it will.

As I saw it later, the reason for this is that, at least on my system, when I allocate char buff[7] (with or without =""), the 8th byte (counting from 1, not from 0) gets set to 0. From what I guess, things are done like this precisely so that a for loop with 7 writes, followed by a string formatted read, could succeed, whether the last character to be written was '\0' or not, and thus avoiding the need for the programmer to set the last '\0' himself, when writing chars individually.

From this, it follows that in the case of

fgets(buff, sizeof(buff), stdin);

and then providing a too long input, the resulting buffstring will automatically have two '\0' characters, one inside the array, and one right after it that was written by the system.

I have also observed that doing

fgets(buff,(sizeof(buff)+17),stdin);

will still work, and output a very long string, without crashing. From what I guessed, this is because fgets will keep writing until sizeof(buff)+17, and the last char to be written will precisely be a '\0', ensuring that any forthcoming string reading process would terminate properly (although the memory is messed up anyway).

But then, what about fgets(buff, (sizeof(buff)+1),stdin);? this would use up all the space that was rightfully allocated in buff, and then write a '\0' right after it, thus overwriting...the '\0' previously written by the system. In other words, yes, fgets would go out of bounds, but it can be proven that when adding only one to the length of the write, the program will never crash.

So in the end, here comes the question: why does fgets always terminates its write with a '\0', when another '\0', placed by the system right after the array, already exists? why not do like in the one by one for-loop based write, that can access the whole of the array and write anything the programmer wants, without endangering anything?

Thank you very much for your answer!

EDIT: indeed, there is no proof possible, as long as I do not know whether this 8th '\0' that mysteriously appears upon allocation of buff[7], is part of the C standard or not, specifically for string arrays. If not, then...it's just luck that it works :-)

Community
  • 1
  • 1
MrBrody
  • 301
  • 2
  • 13
  • Be careful thinking that because something doesn't crash it means it's actually correct; usually errors like this result in "undefined behavior". Sometimes you'll get a segfault, sometimes you won't. If you have buff[7], there's no guarantee that the 8th byte will be a \0, it could be anything. – PherricOxide Aug 28 '13 at 19:21
  • 3
    "I will always get 7 'a's, and the program will not crash" - That you expect it *should/could* crash at least suggests you understand *undefined behavior*. Regarding your question, because that is how [`fgets`](http://en.cppreference.com/w/c/io/fgets) is required to behave. If you have `char a;` and pass `&a` with some arbitrary size greater than 1 would you *expect* anything *definitive* ? Its a C-api, and like most, either useful or as undefined in its behavior, depending on how *you* call it. – WhozCraig Aug 28 '13 at 19:22
  • I understand that any testing on my single machine will never prove anything. I was just thinking of the string viewed as an array, where you write anything you want without thinking about what the last cell would contain (like you would in an `int[]`), and then thought of as a string, i.e. as a word, with the omnipresent fear of the missing terminating character. Because of that, the standard may have included this 8th '\0' as a hard-set parameter. As I don't know the details of the standard...I was asking the question: is this eighth `'\0'` part of the C standard? – MrBrody Aug 28 '13 at 19:30
  • You said: _I will always get 7 'a's, and the program will not crash. But if I try to write 8 'a's, it will._ It is unfortunate that it doesn't crash when you write the 7 a's, but you're invoking undefined behaviour and a crash could easily occur. Off-by-one errors are insidious because they can lull you into a false sense of security. You'd probably find that there was an unused byte on the stack after the `buff[6]` (because the next variable needed to be aligned on an even boundary). But you can't rely on that... – Jonathan Leffler Aug 28 '13 at 19:50
  • @Jonathan Leffler: I understand what you say, however, I just observed that this 8th byte was not just unused, it was ALWAYS set to 0, while all bytes around could be anything, if not set properly before. I was wondering if this coincidence was part of the standard or not! – MrBrody Aug 28 '13 at 19:56
  • 1
    @MrBrody Since you are using a 7 bytes array, it probably gets padded with zeros. Try with an 8 bytes array and see if you are as [un]lucky. When you ask for `n` bytes you are guaranteed `n` bytes but you _might_ get more. – agbinfo Aug 28 '13 at 20:02
  • @agbinfo the interesting part is that when I just allocate (`char buff[7]`) without initializing, then buff[6] proved to be anything (32, 129...whatever was left there before), but buff[7] was surprisingly constant at being always 0. That was what surprised me! – MrBrody Aug 28 '13 at 20:12
  • The extra byte being zero was (bad) luck — the standard says nothing about that byte beyond you can take its address (but you may not dereference it without invoking undefined behaviour). – Jonathan Leffler Aug 28 '13 at 20:14
  • Understood! not part of the standard. – MrBrody Aug 28 '13 at 20:16

2 Answers2

3

but it can be proven that when adding only one to the length of the write, the program will never crash.

No! You can't prove that! Not in the sense of a mathematical proof. You have only shown that on your system, with your compiler, with those particular compiler settings you used, with particular environment configuration, it might not crash. This is far from a mathematical proof!

In fact the C standard itself, although it guarantees that you can get the address of "one place after the last element of an array", it also states that dereferencing that address (i.e. trying to read or write from that address) is undefined behaviour.

That means that an implementation can do everything in this case. It can even do what you expect with naive reasoning (i.e. work - but it's sheer luck), but it may also crash or it may also format your HD (if your are very, very unlucky). This is especially true when writing system software (e.g. a device driver or a program running on the bare metal), i.e. when there is no OS to shield you from the nastiest consequences of writing bad code!

Edit This should answer the question made in a comment (C99 draft standard):

7.19.7.2 The fgets function

Synopsis

#include <stdio.h>
char *fgets(char * restrict s, int n,
    FILE * restrict stream);

Description

The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.

Returns

The fgets function returns s if successful. If end-of-file is encountered and no characters have been read into the array, the contents of the array remain unchanged and a null pointer is returned. If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned.

Edit: Since it seems that the problem lies in a misunderstanding of what a string is, this is the relevant excerpt from the standard (emphasis mine):

7.1.1 Definitions of terms

A string is a contiguous sequence of characters terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the string or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of bytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order.

  • Do you mean that writing the 7 characters with a for loop, and then reading the resulting string, will also be an undefined behaviour? I did not directly access any "illegal" address, doing this. I could perfectly write anything in `buff[6]`, and then read `buff`...or is this forbidden by the standard? PS: you're right about the proof. In order to prove anything, I need to know whether this 8th `'\0'`, specifically for string arrays, is part of the standard or not! – MrBrody Aug 28 '13 at 19:42
  • Ok, I made a confused comment. I was wondering whether the 8th `'\0'` I could see right after allocation (allocation!), was written because of compliance to the standard, or just some specificity of the compiler on that system! If there ALWAYS is an 8th `'\0'`, then this last `'\0'` written by fgets becomes unnecessary – MrBrody Aug 28 '13 at 19:53
  • 1
    It is not clear what you mean by `\0` is part of the standard. The standards defines as strings **only** null-terminated sequences of chars, i.e. sequences of chars whose last character is a null character (i.e. a character with numeric code 0). If you have an array of character this *might* contain a string. It does if there is a null char somewhere *in it*. You cannot take into account the character-past-the-end, which *does not exist* conceptually. – LorenzoDonati4Ukraine-OnStrike Aug 28 '13 at 19:54
  • 1
    To be more specific: if you define `char ca[5]`, then `ca` contains a string if any of its elements is a `'\0'`. More precisely, the string begins at `ca[0]` end ends with the (first) element holding `'\0'`, which acts as a string terminator. This null char must be *in the array*. There is no `ca[6]`. If you write such an expression you invoke undefined behaviour. You may have a memory location which might be accessed by the expression `ca[6]`, but this is illegal. C doesn't check if something with UB is illegal. It is *by definition*. Whatever happens in this case is *unpredictable*. – LorenzoDonati4Ukraine-OnStrike Aug 28 '13 at 20:01
  • your two last comments answer my question. I did not know about this distinction between char arrays and strings. So this ca[6] set as 0 is just a convenience of the compiler, that would thus "save" programs from crashing, whether for good or bad. Thanks! – MrBrody Aug 28 '13 at 20:09
  • No, you can't do that inference. You cannot say *why* ca[6] happens to be 0. It doesn't need to have been put there by the compiler (maybe it's the runtime or the OS). There are dozens of reasons why you observe this apparently deterministic behavior. You could tell only by reading your compiler docs (if this is documented behavior) or worse you could only discover it by analyzing your compiler source code (assuming it is a behavior due to the compiler). – LorenzoDonati4Ukraine-OnStrike Aug 28 '13 at 20:19
  • you're right. I was just guessing. I agree with the fact it's very difficult to tell the actual reason of this "apparently deterministic" behaviour! It was just out of curiosity, I won't need to go that far and know the details – MrBrody Aug 28 '13 at 20:22
  • BTW, if a program crashes in the presence of undefined behavior this is a good thing (since UB means the program has bugs). A sane compiler would never make an erroneus program avoid crashing on purpose. Forcing a program to crash on UB is good, but it may not be feasible in all cases and often precludes some optimizations. Many attack vectors of malware exploit programs that *don't* crash when forced to invoke UB! – LorenzoDonati4Ukraine-OnStrike Aug 28 '13 at 20:23
  • That's why I was surprised I could do `fgets(buff, 17, stdin);`, and still have no crash. I use Code::Blocks on windows, and I do see some "Mingw_Nothrow" in the prototypes of all standard library functions. Is this related? – MrBrody Aug 28 '13 at 20:26
2

From C11 standard draft:

The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.

The fgets function returns s if successful. If end-of-file is encountered and no characters have been read into the array, the contents of the array remain unchanged and a null pointer is returned. If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned.

The behaviour you describe is undefined.

jev
  • 2,023
  • 1
  • 17
  • 26
  • This I know. I was just wondering whether the 8th `'\0'` I see even before I write anything to the string (i.e., upon allocation), is merely accidental, or complying to another paragraph of the standard, stating for example that the additional byte right after a `char[]` array, will always be written as `'\0'` upon allocation of the array... – MrBrody Aug 28 '13 at 19:50
  • 1
    If you are lucky the array is padded so that there might be a byte or two at the end you might access -- but you should not rely on this (undefined behaviour). – jev Aug 28 '13 at 19:58
  • Lorenzo Donati told me why (difference between char arrays and strings). Understood, it is just luck from the compiler, or whatever else! – MrBrody Aug 28 '13 at 20:14