4

Note: The original version of my question was compiler-agnostic and assumed that GCC (which I used to experiment) behaves entirely correctly and that a non-empty prefix of a matching input sequence doesn't lead to a matching failure or input failure. It turns out (see: C17 draft, 7.21.6.2 ¶10) that the answer is more likely to be found in compiler/library bugs than in the intricacies of the definition and proper treatment of prefixes to a match. However, in order to preserve the original spirit of the question, I have edited it only conservatively (therefore, the original assumption still shines through in the latter half of this post's body).

With this in mind, an aspect of the issues spanned by this post is still unresolved, namely: whether in the %4c example (at the bottom) it is proper for CD to be written into q[].


According to the standard (C17 draft, 6.4.4.2 ¶1), 2E0 (2.0) and .5 (0.5) are valid floating constants, while 2E and . are not.

Yet, with GCC, scanf parses 2E as 2.0, but it doesn't parse . as anything:

#include <stdio.h>

int main(void) {
    float fl;
    char c;

    printf("Please enter a floating-point number: ");
    if (scanf("%f", &fl) == 1)
        printf("<%.2f>\n", fl);
    if (scanf("%c", &c) == 1)
        printf("[%c]\n", c);

    return 0;
}

Intended usage:

Please enter a floating-point number: 123.4qrst
<123.40>
[q]

Here, q is used as a dummy character, to demonstrate how much of the input buffer the previous call to scanf consumed. Entering only a floating-point number will cause c to contain a newline character:

Please enter a floating-point number: 123.4
<123.40>
[
]

Let's try to parse 2E and . as floats:

With GCC (12.2.0, MinGW on Windows), the above code produces (gcc -std=c17 -pedantic -Wall -Wextra):

Please enter a floating-point number: 2Eq
<2.00>
[q]
Please enter a floating-point number: .q
[q]

With MSVC (19.35.32217.1), I get (cl /std:c17 /Wall):

Please enter a floating-point number: 2Eq
[q]
Please enter a floating-point number: .q
[q]

(Let's ignore the fact that it's not clear what floating-point number a string . "should" represent: 0 or 1.)


Let's try to make sense of this. Relevant here seems to be the following clause from the standard (C17 draft, 7.21.6.2 ¶9):

An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.291) The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.

291)fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf.

As far as I can tell, the standard's most relevant clause about strtod and friends (here: strtof) in relation to my question is 7.22.1.3 ¶3 (not reproduced here), from which it follows that 2E0 (2.0) and .5 (0.5) are valid "subject sequences" (in the sense of 7.22.1.3 ¶2), while 2E and . are not.

If I understand clause 7.21.6.2 ¶9 correctly, only 0-length input items amount to matching failures or input failures. Because each of 2E and . is a valid prefix of a matching input sequence (albeit not a full matching input sequence), neither is a matching failure or input failure.

Hence we can ask: Why does scanf parse 2E as a float, but not . (on GCC)?

  • This might be related to subtleties surrounding the definition of a "prefix of a matching input sequence".
  • It is also possible that details around clause 7.22.1.3 ¶3 (regarding strtod/strtof/strtold) are relevant, as they relate to the pushback limit of 1 of scanf and friends.

I believe that the following code might illustrate the notion of a "prefix of a matching input sequence":

#include <stdio.h>

int main(void) {
    char p[5] = "pppp", q[5] = "qqqq";
    int i;

    i = sscanf("ABCD", "%2c%4c", p, q);
    printf("<%s>\n", p);  /* <ABpp> */
    printf("<%s>\n", q);  /* <CDqq> */
    printf("%d\n", i);  /* 2 (GCC), 1 (MSVC) */

    return 0;
}

(Note that oddly GCC and MSVC give different results for the number of items assigned, even though both write AB and CD onto p[] and q[], resp.)

Here, even though exactly 4 characters are required for a match for %4c / q (C17 draft, 7.21.6.2 ¶12, item c; irrelevant footnote about multibyte characters not reproduced here)

Matches a sequence of characters of exactly the number specified by the field width (1 if no field width is present in the directive).[fn]

CD is a valid "prefix of a matching input sequence", and therefore this code doesn't result in a matching failure or input failure. (Given that assignments can be shorter than the given field width, I find it confusing that the standard uses the word "exactly".)

Or: If GCC or potentially MSVC don't behave correctly, what should the output be, here and for the %f example above?


I found 2 similar questions (listed here in no particular order):

(I believe that this question of mine has broader coverage with the . and %Nc examples.)

Lover of Structure
  • 1,561
  • 3
  • 11
  • 27
  • That standard behavior of `scanf`: It reads as many characters it can that matches the format specifier. So the input `2E` will be parsed as `2.0` for `%f` or `%lf`. Or `2` for any decimal integer format. Or `46` if read using `%x`. – Some programmer dude Aug 12 '23 at 10:32
  • @Someprogrammerdude `.` is a prefix of `.5` (and many other *floating constants*). – Lover of Structure Aug 12 '23 at 10:38
  • Yes `.5` os a valid floating point number, but `.q` is not. Because of that `scanf` will read the `.` then notice that the next character isn't valid and fail since `.` is not a valid number, leaving `q` in the input buffer. – Some programmer dude Aug 12 '23 at 10:40
  • @Someprogrammerdude `2E` is not a valid *floating constant* either. `q` in my examples is just a dummy character, used to demonstrate how much of the input buffer the previous call to `scanf` consumed. – Lover of Structure Aug 12 '23 at 10:42
  • 1
    You show a terminal display of “Please enter a floating-point number: .q” / “[q]” but do not say whether that is expected behavior or observed behavior. Which? Or is it a typo? When I run the code in the question and type input of “.q”, the line output after that is “[.]”, not “[q]”. – Eric Postpischil Aug 12 '23 at 10:46
  • @EricPostpischil Thanks; it's observed behavior on my system; see my edit just now. – Lover of Structure Aug 12 '23 at 10:52
  • Again, `scanf` is *greedy*, it will read as much as possible to match the format specified. Therefore for the input `2E` it will read the `2` and accept it as a valid floating point value. Then it will see the `E`, see that `E` is not a valid part of a floating point value, and stop parsing the input. The `scanf` function has consumed the valid floating point value `2`, and left the invalid `E` in the input buffer. That's how `scanf` works. That's how it have always worked. And that's how it's *specified* to work. – Some programmer dude Aug 12 '23 at 14:33
  • 1
    @Someprogrammerdude The `E` is not left in the input buffer; it is consumed, like the `.`, for both GCC (glibc) and MSVC. The issue is more subtle than appears at first sight. Under your interpretation, the question then is why `E` is not fed back into the input buffer. Maybe `scanf` can't put it back because it has to read the next character to decide (and it has only a memory of size 1). Either way, I don't think these things are so obvious, plus the differences between GCC/glibc and MSVC are an issue. But feel free to write an answer about this. – Lover of Structure Aug 12 '23 at 14:48
  • @Someprogrammerdude (That said, your conceptualization of some sort of greedy algorithm is reasonable on the intuitive level.) – Lover of Structure Aug 12 '23 at 15:10
  • 1
    This is a known bug in glibc: [bug 12701](https://sourceware.org/bugzilla/show_bug.cgi?id=12701). With the glibc behavior, it may be needed to read ahead an unbounded number of characters in order to determine the longest matching sequence, and this unbounded number of characters would need to be pushed back; so, this behavior is unacceptable. See the [issue with nan strings](https://sourceware.org/bugzilla/show_bug.cgi?id=30647). – vinc17 Aug 16 '23 at 13:21
  • @vinc17 Thanks a lot. Should I branch my `%4c` example off into a separate question (which you can then answer)? – Lover of Structure Aug 16 '23 at 13:33
  • @LoverofStructure Yes, if you want (this `%4c` case, together with nan strings, would explain the behavior required by the standard). – vinc17 Aug 16 '23 at 13:43

1 Answers1

6

If I understand clause 7.21.6.2 ¶9 correctly, only 0-length input items amount to matching failures or input failures.

7.21.6.2 9 is not the only paragraph that specifies matching failures. Paragraph 10 says:

… If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure…

“.” is prefix of a matching input sequence, so it is scanned (consumed, removed from the stream), but it is not a matching input sequence, so there is a matching failure.

printf("%d\n", i); /* 2 (GCC), 1 (MSVC) */

The MSVC result conforms to the C standard. The GCC result (due to glibc, not GCC) does not. For %4c, a matching sequence is “exactly the number [of characters] specified by the field width” (C 2018 7.21.6.2 12). Therefore “CD” is not a matching sequence. It is, however, a prefix of a matching sequence. So, it should be consumed, and scanf should process it as a matching failure. So the prior %2c matched and the %4c did not, so there is one completed assignment of input items, so the return value should be one.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • I just noticed that MSVC behaves differently. I edited the question further a bit (maybe it's a GCC issue, but I'm really not sure). – Lover of Structure Aug 12 '23 at 11:02
  • 2
    @LoverofStructure: It is a GNU C Library (glibc) problem, not GCC. Per [this bug report](https://sourceware.org/bugzilla/show_bug.cgi?id=1765), it was decided not to conform to the C standard. Consuming the “.”, so the next character in the stream is “q” and reporting a matching failure conforms to the standard. Not reporting a matching failure or leaving the “.” in the stream does not. – Eric Postpischil Aug 12 '23 at 11:04
  • Ah -- I did see that bug report briefly (linked to from one of the 2 questions listed at the bottom of my post), but now I know why I got hung up on the "prefix of a matching input sequence" idea: the `%Nc` case (my example at the bottom). What is the expected output there? That code putting `CD` into `q` made me think that prefixes don't amount to matching failures. – Lover of Structure Aug 12 '23 at 11:07
  • @LoverofStructure: I think the `%4c` case does not conform to the standard. – Eric Postpischil Aug 12 '23 at 11:07
  • 1
    If `%4c` leads to a *matching failure*, is it still legal for MSVC to write `CD` into `q[]`? – Lover of Structure Aug 12 '23 at 11:43
  • I think `scanf` for `%f` is supposed to consume the same sequence that `strtod` does, but in this case it does not. When `"2e"` is passed to `strtod`, only the `2` is consumed. Seems like a `scanf` bug. – Tom Karzes Aug 12 '23 at 11:58
  • @TomKarzes You mean `strtof`? – Lover of Structure Aug 12 '23 at 12:01
  • 1
    @LoverofStructure Well, `strtod` and `strtof` accept the same syntax. But yeah, the same applies to `strtof`. The C standard mentions `strtod`, not `strtof`, which is why that's what I tested. From the C standard description of `fscanf`: *a,e,f,g Matches an optionally signed floating-point number, infinity, or NaN, whose format is the same as expected for the subject sequence of the strtod function. The corresponding argument shall be a pointer to floating.* – Tom Karzes Aug 12 '23 at 12:14