43

In the various cases that a buffer is provided to the standard library's many string functions, is it guaranteed that the buffer will not be modified beyond the null terminator? For example:

char buffer[17] = "abcdefghijklmnop";
sscanf("123", "%16s", buffer);

Is buffer now required to equal "123\0efghijklmnop"?

Another example:

char buffer[10];
fgets(buffer, 10, fp);

If the read line is only 3 characters long, can one be certain that the 6th character is the same as before fgets was called?

Segmented
  • 795
  • 6
  • 13
  • 2
    A better question would be "Why does it matter?". You shouldn't be relying on behind the scenes and unspecified behavior. If the behavior is needed then an explicit function that is clearly intended to function that way should be created/used. – Dunk Feb 25 '15 at 19:36
  • How would `scanf` know that your buffer was more than 4 characters long? – user253751 Feb 25 '15 at 20:16
  • 1
    In the case above, the %16s tells sscanf that the caller guarantees the buffer can hold at least 16+1 characters. – caveman Feb 25 '15 at 20:23
  • 1
    @Dunk: There are very good reasons it matters. Relying on this behavior is perhaps the only way to deal safely with `'\0'` bytes in data read by `fgets`, which you might care to do. Without pre-filling the buffer (with `'\n'`s, IIRC) there's no way to distinguish an embedded `'\0'` from the terminator in a final line that doesn't end in a newline. – R.. GitHub STOP HELPING ICE Feb 26 '15 at 07:34
  • @R - As I said. I would write a function that works regardless of what the compiler writer or library implementer chose to do. I wouldn't rely on a specific implementation or obscure behavior that isn't even reasonable to expect. – Dunk Feb 28 '15 at 18:18
  • I'll also point out that I have made quite a good reputation for myself and probably owe a couple promotions to being assigned to "tiger teams" and fixing problems created by clever people relying on obscure behavior similar to things like this. – Dunk Feb 28 '15 at 18:50
  • @Dunk I would like to remind you the question is not asking about implementation specific behaviour, in fact I only chose to accept an answer backed by the _standard_. Knowing the ins and outs of the standard, and requesting clarification when the standard seems to leave out information is *not* unreasonable. – Segmented Feb 28 '15 at 22:59
  • @Segmented:The accepted answer is wrong even though it relies on the standard. Some compiler writers do modify the buffer in order to perform diagnostics such as detecting memory leaks. Even though the accepted answer took a quote from the standard, the quote leaves room for interpretation. Thus, even if fgets/sscanf don't change the buffer, you aren't guaranteed they won't. Regardless of what I just said and even if every compiler writer interpreted the standard in the exact same way, nobody should be relying on such an obscure nuance if the intent is to write robust and maintainable code. – Dunk Mar 05 '15 at 15:07
  • @Dunk provide references. I agree with hvd's points and think it is grounds to say that such an implementation would be non-conformant. – Segmented Mar 05 '15 at 19:35
  • As I said, it doesn't matter what the standard says because relying on implementation and non-obvious behavior (which this certainly is) is a really bad programming practice. As for my reference, just read 7.21.7.2 regarding fgets. While the spec can be interpreted to imply that fgets just fills in the buffer with the number of read bytes, it doesn't explicitly prohibit fgets from storing data in the unused parts of the buffer. Thus, OPEN TO INTERPRETATION. – Dunk Mar 06 '15 at 15:03

7 Answers7

31

The C99 draft standard does not explicitly state what should happen in those cases, but by considering multiple variations, you can show that it must work a certain way so that it meets the specification in all cases.

The standard says:

%s - Matches a sequence of non-white-space characters.252)

If no l length modifier is present, the corresponding argument shall be a pointer to the initial element of a character array large enough to accept the sequence and a terminating null character, which will be added automatically.

Here's a pair of examples that show it must work the way you are proposing to meet the standard.

Example A:

char buffer[4] = "abcd";
char buffer2[10];  // Note the this could be placed at what would be buffer+4
sscanf("123 4", "%s %s", buffer, buffer2);
// Result is buffer =  "123\0"
//           buffer2 = "4\0"

Example B:

char buffer[17] = "abcdefghijklmnop";
char* buffer2 = &buffer[4];
sscanf("123 4", "%s %s", buffer, buffer2);
// Result is buffer = "123\04\0"

Note that the interface of sscanf doesn't provide enough information to really know that these were different. So, if Example B is to work properly, it must not mess with the bytes after the null character in Example A. This is because it must work in both cases according to this bit of spec.

So implicitly it must work as you stated due to the spec.

Similar arguments can be placed for other functions, but I think you can see the idea from this example.

NOTE: Providing size limits in the format, such as "%16s", could change the behavior. By the specification, it would be functionally acceptable for sscanf to zero out a buffer to its limits before writing the data into the buffer. In practice, most implementations opt for performance, which means they leave the remainder alone.

When the intent of the specification is to do this sort of zeroing out, it is usually explicitly specified. strncpy is an example. If the length of the string is less than the maximum buffer length specified, it will fill the rest of the space with null characters. The fact that this same "string" function could return a non-terminated string as well makes this one of the most common functions for people to roll their own version.

As far as fgets, a similar situation could arise. The only gotcha is that the specification explicitly states that if nothing is read in, the buffer remains untouched. An acceptable functional implementation could sidestep this by checking to see if there is at least one byte to read before zeroing out the buffer.

caveman
  • 1,755
  • 1
  • 14
  • 19
  • The more I think about it, wouldn't your argument break down if you provide size to the buffers, which is sort of what the question was all about... In your example if we provide sizes for the buffers to the format, they *must* be non-overlapping by the standard, no? And then we can no longer implicitly assume anything... – Segmented Feb 25 '15 at 07:23
  • 1
    @Segmented: `fgets` is explicitly documented as leaving the entire contents of the buffer unspecified (that is, from the provided start address for the provided number of bytes) after a read error. If there is no read error, then I believe that the part of the buffer after the inserted 0 byte will not be touched. – rici Feb 25 '15 at 07:30
  • sscanf() doesn't even know the buffer size – Ángel Feb 25 '15 at 17:40
  • @Ángel This is incorrect, you may provide size modifiers in the format string, as in "%32s". Perhaps you are confused on the term "buffer" which does not necessarily have to be the *input* buffer. – Segmented Feb 25 '15 at 18:03
  • @Segmented, I was talking about the size of the buffer, not the number of bytes it is allowed to write from the string. – Ángel Mar 02 '15 at 16:24
  • In c, no function "knows" the buffer size to buffers it is given. However the caller can tell the function via parameters a minimum buffer size guarantee. For sscanf, it can be passed via the format string. The effect is the same as an explicit size parameter. – caveman Mar 02 '15 at 18:05
24

Each individual byte in the buffer is an object. Unless some part of the function description of sscanf or fgets mentions modifying those bytes, or even implies their values may change e.g. by stating their values become unspecified, then the general rule applies: (emphasis mine)

6.2.4 Storage durations of objects

2 [...] An object exists, has a constant address, and retains its last-stored value throughout its lifetime. [...]

It's this same principle that guarantees that

#include <stdio.h>
int a = 1;
int main() {
  printf ("%d\n", a);
  printf ("%d\n", a);
}

attempts to print 1 twice. Even though a is global, printf can access global variables, and the description of printf doesn't mention not modifying a.

Neither the description of fgets nor that of sscanf mentions modifying buffers past the bytes that actually were supposed to be written (except in the case of a read error), so those bytes don't get modified.

Community
  • 1
  • 1
  • Nothing precludes an implementation of fgets that clears the buffer before writing to it. It is a valid implementation from the specification point of view. 6.2.4 certainly doesn't say that fgets cannot do this. All it says is that if fgets changes it, it won't change again on its own. – caveman Feb 25 '15 at 19:37
  • 4
    @caveman The fact that nothing grants `fgets` permission to change any bytes other than those explicitly specified to be modified means that such an implementation of `fgets` would fail to conform. In the abstract machine, `fgets` does not change those bytes, and the value of those bytes does not become indeterminate, unspecified or undefined. Therefore, concrete implementations must maintain the behaviour of leaving those bytes at their last-stored values. If you disagree, then what do you think about the example in my answer? Is that allowed to print `"1\n2\n"`? If not, why not? –  Feb 25 '15 at 19:43
  • 2
    That's a good argument, and I agree. I think your comment makes a clearer argument than your answer itself. The main point that I was missing from your answer is that you are considering each byte an object. So the statement that they don't become indeterminate, unspecified, or defined really clarifies what you are saying. – caveman Feb 25 '15 at 20:01
  • 1
    This probably not all there is to it as far the standard goes. What defines the last-stored value can be a long story, or else you wouldn't be able to have memory-mapped registers that can be changed by the hardware (which is a common technique on many embedded platforms). – Fizz Feb 25 '15 at 22:12
  • 2
    @RespawnedFluff The standard attaches a footnote to that saying "In the case of a volatile object, the last store need not be explicit in the program.". And that footnote is backed up by the normative text describing `volatile` (6.7.3p6). But in the OP's case, there are no volatile objects, so that isn't an issue. –  Feb 26 '15 at 06:38
8

The standard is somewhat ambiguous on this, but I think a reasonable reading of it is that the answer is: yes, it's not allowed to write more bytes to the buffer than it read+null. On the other hand, a stricter reading/interpretation of the text could conclude that the answer is no, there's no guarantee. Here's what a publicly avaialble draft says about fgets.

char *fgets(char * restrict s, int n, FILE * restrict stream);

The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.

The fgets function returns s if successful. If end-of-file is encountered and no characters have been read into the array, the contents of the array remain unchanged and a null pointer is returned. If a read error occurs during the operation, the array contents are indeterminate and a null pointer is returned.

There's a guarantee about how much it is supposed to read from the input, i.e. stop reading at newline or EOF and not read more than n-1 bytes. Although nothing is said explicitly about how much it's allowed to write to the buffer, the common knowledge is that fgets's n parameter is used to prevent buffer overflows. It's a little strange that the standard uses the ambiguous term read, which may not necessarily imply that gets can't write to the buffer more than n bytes, if you want to nitpick on the terminology it uses. But note that the same "read" terminology is used about both issues: the n-limit and the EOF/newline limit. So if you interpret the n-related "read" as a buffer-write limit, then [for consistency] you can/should interpret the other "read" the same way, i.e. not write more than what it read when string is shorter than the buffer.

On the other hand, if you distinguish between the uses of the phrase-verb "read into" (="write") and just "read", then you can't read the committee's text the same way. You are guaranteed that it won't "read into" (="write to") the array more than n bytes, but if the input string is terminated sooner by newline or EOF you're only guaranteed the rest (of the input) won't be "read", but whether that implies in won't be "read into" (="written to") the buffer is unclear under this stricter reading. The crucial issue is keyword is "into", which is elided, so the problem is whether the completion given by me in brackets in the following modified quote is the intended interpretation:

No additional characters are read [into the array] after a new-line character (which is retained) or after end-of-file.

Frankly a single postcondition stated as a formula (and would be pretty short in this case) would have been a lot more helpful than the verbiage I quoted...

I can't be bothered to try and analyze their writeup about the *scanf family, because I suspect it's going to be even more complicated given all the other things that happen in those functions; their writeup for fscanf is about five pages long... But I suspect a similar logic applies.

Community
  • 1
  • 1
Fizz
  • 4,782
  • 1
  • 24
  • 51
  • It's not hard to imagine that with some kinds of OS, being allowed to overwrite information beyond the end of the first line [but within the indicated amount of space] could improve performance. If the buffer is 128 bytes and the time to read 128 bytes is less than twice to the time to read one, and the time for a relative fseek is similar, then except for lines shorter than four characters, reading 128 bytes, scanning for a newline, and then seeking back if needed may be faster than reading bytes individually. – supercat Feb 25 '15 at 19:57
  • @supercat: Well, yes, you could imagine that, but then the common sense [in any specification] is that a function won't have (user-visible) side-effects that aren't specified in the standard. In the case of the C standard, the verbiage that guarantees this general, common-sense principle has been pointed out in [hvd's answer](http://stackoverflow.com/a/28716991/3588161). – Fizz Feb 25 '15 at 20:35
  • I agree that since the standard *didn't* authorize implementations to write past the end, an `fgets` shouldn't expect client code to tolerate such behavior. My point was that there would have been potential performance advantages to describing that buffer contents past the null bye as unspecified, had those writing the spec chosen to do so. – supercat Feb 25 '15 at 20:40
4

is it guaranteed that the buffer will not be modified beyond the null terminator?

No, there's no guarantee.

Is buffer now required to equal "123\0efghijklmnop"?

Yes. But that's only because you've used correct parameters to your string related functions. Should you mess up buffer length, input modifiers to sscanf and such, then you program will compile. But it will most likely fail during runtime.

If the read line is only 3 characters long, can one be certain that the 6th character is the same as before fgets was called?

Yes. Once fgets() figures you have a 3 character input string it stores the input in the provided buffer, and it doesn't care about the reset of provided space at all.

Igor S.K.
  • 999
  • 6
  • 17
  • 2
    You contradict yourself, if there is no guarantee then it follows that the two answers you provide following that should be "no". – Segmented Feb 25 '15 at 06:54
  • @Segmented : read my answer once again (with edits). Igor S.K. just explained reality (that sometimes is different from standard) – VolAnd Feb 25 '15 at 06:58
  • @Segmented Well, "in the various cases" of "many string functions" "No" is quite a common answer. Imagine `strcpy(dst,src)`: `src` will be left unmodified at all, but writing to `dst` might cause, say, buffer overrun. Or it might be just partially modified, keeping everything after `dst[strlen(src)]` intact. And I also gave my answers for you specific examples. – Igor S.K. Feb 25 '15 at 11:12
1

Is buffer now required to equal "123\0efghijklmnop"?

Here buffer is just consists of 123 string guaranteed terminating at NUL.

Yes the memory allocated for array buffer will not get de-allocated, however you are making sure/restricting your string buffer can atmost only have 16 char elements which you can read into it at any point of time. Now depends whether you write just a single char or maximum what buffer can take.

For example:

char buffer[4096] = "abc";` 

actually does something below,

memcpy(buffer, "abc", sizeof("abc"));
memset(&buffer[sizeof("abc")], 0, sizeof(buffer)-sizeof("abc"));

The standard insists that if any part of char array is initialized that is all it consists of at any moment until obeying its memory boundary.

alk
  • 69,737
  • 10
  • 105
  • 255
Sunil Bojanapally
  • 12,528
  • 4
  • 33
  • 46
  • 2
    The memory beyond the terminator does not just vanish. It is there that I am curious what the standard states, not before it. – Segmented Feb 25 '15 at 06:29
  • Yes you can read into them at any time, this is why I'm curious what the standard dictates regarding this extra space beyond the terminator. – Segmented Feb 25 '15 at 06:59
0

There are no any guarantees from standard, which is why the functions sscanf and fgets are recommended to be used (with respect to the size of the buffer) as you show in your question (and using of fgets is considered preferable compared with gets).

However, some standard functions use null-terminator in their work, e.g. strlen (but I suppose you ask about string modification)

EDIT:

In your example

fgets(buffer, 10, fp);

untouching characters after 10-th is guaranteed (content and length of buffer will not be considered by fgets)

EDIT2:

Moreover, when using fgets keep in mind that '\n' will be stored in the buffers. e.g.

 "123\n\0fghijklmnop"

instead of expected

 "123\0efghijklmnop"
VolAnd
  • 6,367
  • 3
  • 25
  • 43
  • 1
    I'm not the down voter but you do not seem to grasp the original question, and you do not quote the standard besides. – Segmented Feb 25 '15 at 06:45
  • 1
    Standard said nothing about saving charactercharacter untouched in the memory allowed to be used by the second parameter of fgets... so my answer "no guarantees" – VolAnd Feb 25 '15 at 06:52
0

Depends on the function in use (and to a lesser degree its implementation). sscanf will start writing when it encounters its first non-whitespace character, and continue writing until its first whitespace character, where it will add a finishing 0 and return. But a function like strncpy (famously) zeroes out the rest of the buffer.

There is however nothing in the C standard which mandates how these functions behave.

Steve D
  • 373
  • 2
  • 17