Is this fscanf behavior inconsistent?

Question

Typically fscanf, when scanning a non-integer using %d, will fail until the non-integer characters are explicitly removed from the input stream. Trying to scan a123 fails, until the a is removed from the input stream.

Trying to scan ------123 fails (fscanf returns 0) but the - is removed from the input stream.

Is this correct behavior for fscanf?

The file contains ----------123 and the result of this code:

#include <stdio.h>

int main(void) {
    int number = 0;
    int result = 0;
    FILE *pf = NULL;

    if (NULL != (pf = fopen("integer.txt", "r"))) {
        while (1) {
            if (1 == (result = fscanf(pf, "%d", &number))) {
                printf("%d\n", number);
            } else {
                if (EOF == result) {
                    break;
                }
                printf("result is %d\n", result);
            }
        }
        fclose(pf);
    }
    return 0;
}

is:

result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
-123

If the file contains a123 the result is an infinite loop.

That seems to me to be inconsistent behavior. No?

If you're trying to write your own `scanf` implementation, this is an "impossible" case. You've already read the first `-`, which might be the beginning of a negative integer. Then you read a second `-`, which isn't a digit, which means your scan has failed. You can push back the second `-`, but `ungetc` only guarantees one character of pushback. So it can be difficult or impossible to push back the first `-`. I suspect that's why you're seeing it being consumed. — Steve Summit, Jun 08 '22 at 18:24
@user3121023 I understand. The problem is that it's very hard for fscanf to do that. By the time it realizes it can't convert, it has read *two* characters, and would therefore need to return both of them to the input stream. But the pushback mechanisms used by stdio typically guarantee only one character of pushback (see [`man ungetc`](https://linux.die.net/man/3/ungetc)). — Steve Summit, Jun 08 '22 at 18:28
You can `ungetc()` any character; it does not have to be the one read previously. — Jonathan Leffler, Jun 08 '22 at 18:34
@SteveSummit: It's actually worse than that: `*scanf()` *must not use `ungetc()`*. That one byte of `ungetc()` is reserved to the user. There has to be a *second* byte of "internal `ungetc()`" for `*scanf()`, and the library implementation must not get them confused... ;-) — DevSolar, Jun 08 '22 at 18:53

DevSolar · Accepted Answer · 2022-06-09T07:03:31.920

The point here is not one of inconsistency, but one of the many limitations of the fscanf() family.

The standard is very specific on how fscanf() parses input. Characters are taken from input one by one, and checked against the format string. If they match, the next character is taken from input. If they don't match, the character is "put back", and the conversion fails.

But only that last character read is ever put back.

C11 7.21.6.2 The fscanf function, paragraph 9 (emphasis mine):

An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. 285) The first character, if any, after the input item remains unread.

fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf.

This one character of push-back has nothing to do with the one character of push-back that ungetc() guarantees -- it is independent and in addition to that. (A user could have fscanf() fail, then ungetc() a character, and expect the ungetc()'d character to come out of input, followed by the character pushed back by the failed fscanf(). *A library function may not call ungetc(), which is reserved to the user.)

This makes implementing the scanning fscanf() somewhat easier, but also makes fscanf() fail in the middle of certain character sequences, without actually retracing to where it began its conversion attempt.

In your case, "--123" read as "%d":

taking the first '-'. Sign. All is well, continue.
taking the second '-'. Matching error.
Put back the last '-'. Cannot put back the second '-' as per above.
Return 0 (conversion failed).

This is (one of) the reason(s) why you should not ever use *scanf() on potentially malformed input: The scan can fail without you knowing where exactly it failed, and without properly rolling back.

It's also a murky corner of the standard that was not actually implemented correctly in a number of mainstream library implementations last time I checked. (And not when I re-checked just now.) ;-)

Other reasons not to use fscanf() on potentially malformed input include, but are not limited to, numerical overflows handled not at all gracefully.

The intended use of fscanf() is to scan known well-formatted data, ideally data that has been written by that same program using fprintf(). It is not well-suited to parse user input.

Hence the usual recommendation is to read full lines of input with fgets(), then parse the line in-memory using strtol(), strtod() etc., which can and will handle things like the above in a well-defined way.

"only that last character read is ever put back." is somewhat supported by spec's "fscanf pushes back at most one input character onto the input stream" footnote. Yet footnotes are only informative - not spec - it does not specify what happens with a 2nd, 3rd .... My GCC pushes back both `-`, then making for an infinite loop with OP's code. As I read C17 § 7.21.6.2, pushing back more than 1 character is allowed (as in `ungetc()`), or it may fail - perhaps making this UB or implementation specific behavior. Very good note about `fgets()`. — chux - Reinstate Monica, Jun 08 '22 at 21:08
@chux-ReinstateMonica: See updated anser; the full paragraph containing the footnote makes the expected behavior clear: The input item -- the longest (prefix of) a matching sequence -- is `'-'`. The first character after the input item -- the second `'-'` -- remains unread. (It gets a bit clearer with the case of `"0xz"` being read by `%i` or `%x`. The `"0x"` gets matched, the `'z'` does not match, "the right thing" would be to match only the `'0'` but the `'x'` cannot be put back due to the one-character limit, so the whole matching has to fail.) GLibC is taking some liberties here. — DevSolar, Jun 09 '22 at 05:57
If you have a look at [the Q/A I linked](https://stackoverflow.com/a/1447864/60281), this is not my own interpretation, this has been verified in conversation with Fred J. Tydeman, Vice-char of PL22.11 (ANSI "C"). The point with existing implementations not adhering to this interpretation is that it's a pathological example of `fscanf()` abuse in the first place, with no good way to recover, so it does not *really* matter. — DevSolar, Jun 09 '22 at 07:02

score 2 · Answer 2 · answered Jun 08 '22 at 18:24

2

Is this correct behavior for fscanf?

Yes, it is, as pointed out by @stark in comments, - is part of the result when you use %d as format specifier.

If you want to scan a positive integer (only digits) you can use a pattern in fscanf to discard all non digits.

fscanf(pf, "%*[^0-9]%d", &number)

answered Jun 08 '22 at 18:24

David Ranieri

39,972
7
52
94

`fscanf(pf, "%*[^0-9]%d", &number)` will fail to scan `"123"` and there is nothing to match `"%*[^0-9]"`. Perhaps use `fscanf(pf, "%*[^0-9]"); fscanf(pf, "%d", &number);`? – chux - Reinstate Monica Jun 08 '22 at 21:11

chqrlie · Answer 3 · 2022-06-08T23:14:43.483

2

This behavior is specified:

Here are the relevant paragraphs from the C2x Standard:

7.21.6.2 The fscanf function

[...]

_⁷   A directive that is a conversion specification defines a set of matching input sequences, as described below for each specifier. A conversion specification is executed in the following steps:
_⁸   Input white-space characters are skipped, unless the specification includes a [, c, or n specifier.
_⁹   An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.³¹⁰⁾ The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
_¹⁰   Except in the case of a % specifier, the input item (or, in the case of a %n directive, the count of input characters) is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. Unless assignment suppression was indicated by a *, the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.

^{310) fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf.}

In your example, the initial - is a prefix of a matching input sequence, and the next character, another -, does not match so it remains in the input stream. The input item, -, is not a matching sequence so you get a conversion failure and 0 is returned but the first - was consumed.

This behavior is observed on linux with the GNUlibc, but not on macOS with Apple Libc, where the initial dash is not consumed.

edited Jun 08 '22 at 23:14

answered Jun 08 '22 at 18:44

chqrlie

131,814
10
121
189

"is specified: fscanf pushes back at most one input character" --> footnotes do not count as specs. – chux - Reinstate Monica Jun 08 '22 at 21:13
"0xx for a conversion of %i will cause a conversion failure" --> What does `int d; printf("%d\n", sscanf("0xx", "%i", &d));` print for you. I get 1. Same with `fscanf()`. – chux - Reinstate Monica Jun 08 '22 at 21:19
1

@chux-ReinstateMonica: I am going to remove this example, I get the same result as you do, but `int d; char buf[10]; int res = sscanf("0xx", "%i%s", &d, buf); printf("%d %d %s\n", res, d, buf);` gives me different output on macOS (`2,0,xx`) and linux (`2,0,x`). `fscanf()` does not consume a `-` from `--1` on macOS either. – chqrlie Jun 08 '22 at 23:12
@chux-ReinstateMonica: the specification (without the footnote) is unambiguous regarding the behavior on `--1` from a stream, The behavior for `0xx` is less obvious, but my Apple Libc seems non conforming because it does not parse the input items correctly. – chqrlie Jun 08 '22 at 23:18
1

I have to say, anyone who writes code that depends on this behavior — that is, anyone who would be inconvenienced by Apple's nonconformance — is really, really asking for trouble... – Steve Summit Jun 08 '22 at 23:19
1

@SteveSummit: indeed, but problems do not arise from people purposely relying on corner case behavior, but more likely their code happens to work on one platform and not on the other and fishing for the corner case is a nightmare. – chqrlie Jun 08 '22 at 23:22
1

chqrlie, IMO, the C spec is _not_ precise enough concerning _push back_ beyond 1 and implementations vary out in the wild. @SteveSummit is one the right track - `fscanf()` is simply not the best tool: Read a line into a string_ and then parse the string makes for the most robust solution. – chux - Reinstate Monica Jun 08 '22 at 23:58
@chqrlie Your "gives me different output on macOS and linux", and my experience lead to OP's "Is this fscanf behavior inconsistent?" as No, the standard is is not specific enough on how `*scanf()` parses input. Yet OP has already accepted and unlikely to shift. – chux - Reinstate Monica Jun 09 '22 at 00:02

Is this fscanf behavior inconsistent?

3 Answers3