15

Note: I completely reworked the question to more properly reflect what I am setting the bounty for. Please excuse any inconsistencies with already-given answers this might have created. I did not want to create a new question, as previous answers to this one might be helpful.


I am working on implementing a C standard library, and am confused about one specific corner of the standard.

The standard defines the number formats accepted by the scanf function family (%d, %i, %u, %o, %x) in terms of the definitions for strtol, strtoul, and strtod.

The standard also says that fscanf() will only put back a maximum of one character into the input stream, and that therefore some sequences accepted by strtol, strtoul and strtod are unacceptable to fscanf (ISO/IEC 9899:1999, footnote 251).

I tried to find some values that would exhibit such differences. It turns out that the hexadecimal prefix "0x", followed by a character that is not a hexadecimal digit, is one such case where the two function families differ.

Funny enough, it became apparent that no two available C libraries seem to agree on the output. (See test program and example output at the end of this question.)

What I would like to hear is what would be considered standard-compliant behaviour in parsing "0xz"?. Ideally citing the relevant parts from the standard to make the point.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int main()
{
    int i, count, rc;
    unsigned u;
    char * endptr = NULL;
    char culprit[] = "0xz";

    /* File I/O to assert fscanf == sscanf */
    FILE * fh = fopen( "testfile", "w+" );
    fprintf( fh, "%s", culprit );
    rewind( fh );

    /* fscanf base 16 */
    u = -1; count = -1;
    rc = fscanf( fh, "%x%n", &u, &count );
    printf( "fscanf:  Returned %d, result %2d, consumed %d\n", rc, u, count );
    rewind( fh );

    /* strtoul base 16 */
    u = strtoul( culprit, &endptr, 16 );
    printf( "strtoul:             result %2d, consumed %d\n", u, endptr - culprit );

    puts( "" );

    /* fscanf base 0 */
    i = -1; count = -1;
    rc = fscanf( fh, "%i%n", &i, &count );
    printf( "fscanf:  Returned %d, result %2d, consumed %d\n", rc, i, count );
    rewind( fh );

    /* strtol base 0 */
    i = strtol( culprit, &endptr, 0 );
    printf( "strtoul:             result %2d, consumed %d\n", i, endptr - culprit );

    fclose( fh );
    return 0;
}

/* newlib 1.14

fscanf:  Returned 1, result  0, consumed 1
strtoul:             result  0, consumed 0

fscanf:  Returned 1, result  0, consumed 1
strtoul:             result  0, consumed 0
*/

/* glibc-2.8

fscanf:  Returned 1, result  0, consumed 2
strtoul:             result  0, consumed 1

fscanf:  Returned 1, result  0, consumed 2
strtoul:             result  0, consumed 1
*/

/* Microsoft MSVC

fscanf:  Returned 0, result -1, consumed -1
strtoul:             result  0, consumed 0

fscanf:  Returned 0, result  0, consumed -1
strtoul:             result  0, consumed 0
*/

/* IBM AIX

fscanf:  Returned 0, result -1, consumed -1
strtoul:             result  0, consumed 1

fscanf:  Returned 0, result  0, consumed -1
strtoul:             result  0, consumed 1
*/
DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • Note that the `strto*` functions have defined behaviour when the subject string generates a value that is too large for the appropriate type. However, with `scanf()`, the behaviour on receipt of a value that is too large is undefined. Thus, inputting `12345678901234567890` to `strtol()` will yield an error indication (assuming `sizeof(long) <= 8`), but anything could happen with `scanf()` et al. – Jonathan Leffler Jul 19 '17 at 04:39

8 Answers8

9

Communication with Fred J. Tydeman, Vice-char of PL22.11 (ANSI "C"), on comp.std.c shed some light on this:

fscanf

An input item is defined as the longest sequence of input characters [...] which is, or is a prefix of, a matching input sequence. (7.19.6.2 P9)

This makes "0x" the longest sequence that is a prefix of a matching input sequence. (Even with %i conversion, as the hex "0x" is a longer sequence than the decimal "0".)

The first character, if any, after the input item remains unread. (7.19.6.2 P9)

This makes fscanf read the "z", and put it back as not-matching (honoring the one-character pushback limit of footnote 251)).

If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. (7.19.6.2 P10)

This makes "0x" fail to match, i.e. fscanf should assign no value, return zero (if the %x or %i was the first conv. specifier), and leave "z" as the first unread character in the input stream.

strtol

The definition of strtol (and strtoul) differs in one crucial point:

The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. (7.20.1.4 P4, emphasis mine)

Which means that strtol should look for the longest valid sequence, in this case the "0". It should point endptr to the "x", and return zero as result.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
2

I don't believe the parsing is allowed to produce different results. The Plaugher reference is just pointing out that the strtol() implementation might be a different, more efficient version as it has complete access to the entire string.

caf
  • 233,326
  • 40
  • 323
  • 462
  • I agree; the `scanf()` and `strto*()` family of functions must produce the same results; the problem is that whereas `sscanf()` actually can employ `strto*()`, `fsancf()` can't for the reasons you gave – Christoph Sep 15 '09 at 11:22
  • @DevSolar: the standard says that `scanf()` accepts the same format as `strto*()`, so if they don't agree, it's a bug – Christoph Sep 15 '09 at 14:30
  • After some thinking about it, I agree. – DevSolar Sep 15 '09 at 17:45
  • ...and after yet some more thinking, and testing, and talking, it becomes obvious that the results of the two function families *indeed* differ in certain situations... :-\ – DevSolar Sep 19 '09 at 11:16
2

According to the C99 spec, the scanf() family of functions parses integers the same way as the strto*() family of functions. For example, for the conversion specifier x this reads:

Matches an optionally signed hexadecimal integer, whose format is the same as expected for the subject sequence of the strtoul function with the value 16 for the base argument.

So if sscanf() and strtoul() give different results, the libc implementation doesn't conform.

What the expected results of you sample code should be is a bit unclear, though:

strtoul() accepts an optional prefix of 0x or 0X if base is 16, and the spec reads

The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form.

For the string "0xz", in my opinion the longest initial subsequence of expected form is "0", so the value should be 0 and the endptr argument should be set to x.

mingw-gcc 4.4.0 disagrees and fails to parse the string with both strtoul() and sscanf(). The reasoning could be that the longest initial subsequence of expected form is "0x" - which is not a valid integer literal, so no parsing is done.

I think this interpretation of the standard is wrong: A subsequence of expected form should always yield a valid integer value (if out of range, the MIN/MAX values are returned and errno is set to ERANGE).

cygwin-gcc 3.4.4 (which uses newlib as far as I know) will also not parse the literal if strtoul() is used, but parses the string according to my interpretation of the standard with sscanf().

Beware that my interpretation of the standard is prone to your initital problem, ie that the standard only guarantees to be able to ungetc() once. To decide if the 0x is part of the literal, you have to read ahead two characters: the x and the following character. If it's no hex character, they have to be pushed back. If there are more tokens to parse, you can buffer them and work around this problem, but if it's the last token, you have to ungetc() both characters.

I'm not really sure what fscanf() should do if ungetc() fails. Maybe just set the stream's error indicator?

Community
  • 1
  • 1
Christoph
  • 164,997
  • 36
  • 182
  • 240
  • 1
    @DevSolar: it would be interesting to know what the sun compiler does, because it claims to be fully compliant: http://developers.sun.com/sunstudio/documentation/ss12u1/mr/READMEs/c.html#about – Christoph Sep 15 '09 at 15:39
2

To summarize what should happen according to the standard when parsing numbers:

  • if fscanf() succeeds, the result must be identical to the one obtained via strto*()
  • in contrast to strto*(), fscanf() fails if

    the longest sequence of input characters [...] which is, or is a prefix of, a matching input sequence

    according to the definition of fscanf() is not

    the longest initial subsequence [...] that is of the expected form

    according to the definition of strto*()

This is somewhat ugly, but a necessary consequence of the requirement that fscanf() should be greedy, but can't push back more than one character.

Some library implementators opted for differing behaviour. In my opinion

  • letting strto*() fail to make results consistent is stupid (bad mingw)
  • pushing back more than one character so fscanf() accepts all values accepted by strto*() violates the standard, but is justified (hurray for newlib if they didn't botch strto*() :()
  • not pushing back the non-matching characters but still only parsing the ones of 'expected form' seems dubious as characters vanish into thin air (bad glibc)
Christoph
  • 164,997
  • 36
  • 182
  • 240
  • `fscanf()` pushing back more than one character does *not* violate the standard - the "one character" limitation applies to user code, it does not apply to the implementation of the standard library itself. – caf Sep 19 '09 at 13:22
  • 1
    @caf: footnote 251) explicitly mentions this: "`fscanf` pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to `strtod`, `strtol`, etc., are unacceptable to `fscanf`" – Christoph Sep 19 '09 at 14:45
  • @caf Also note that the character pushed back by standard functions, and the character pushed back by `ungetc()`, are *different characters*. Your library implementation needs to allow for a previous read attempt having pushed back a character *and* the user having pushed back a character, on the next read or position query. Also, an implementation is at liberty to support more than one character of user pushback, but `fscanf()` may only ever push back that one character -- or your implementation would be non-conforming. – DevSolar May 17 '23 at 11:26
0

I am not sure I understand the question, but for one thing scanf() is supposed to handle EOF. scanf() and strtol() are different kinds of beasts. Maybe you should compare strtol() and sscanf() instead?

Jakob Eriksson
  • 33
  • 1
  • 1
  • 6
0

Answer obsolete after rewrite of question. Some interesting links in the comments though.


If in doubt, write a test. -- proverb

After testing all combinations of conversion specifiers and input variations I could think of, I can say that it is correct that the two function families do not give identical results. (At least in glibc, which is what I have available for testing.)

The difference appears when three circumstances meet:

  1. You use "%i" or "%x" (allowing hexadecimal input).
  2. Input contains the (optional) "0x" hexadecimal prefix.
  3. There is no valid hexadecimal digit following the hexadecimal prefix.

Example code:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    char * string = "0xz";
    unsigned u;
    int count;
    char c;
    char * endptr;

    sscanf( string, "%x%n%c", &i, &count, &c );
    printf( "Value: %d - Consumed: %d - Next char: %c - (sscanf())\n", u, count, c );
    i = strtoul( string, &endptr, 16 );
    printf( "Value: %d - Consumed: %td - Next char: %c - (strtoul())\n", u, ( endptr - string ), *endptr );
    return 0;
}

Output:

Value: 0 - Consumed: 1 - Next char: x - (sscanf())
Value: 0 - Consumed: 0 - Next char: 0 - (strtoul())

This confuses me. Obviously sscanf() does not bail out at the 'x', or it wouldn't be able to parse any "0x" prefixed hexadecimals. So it has read the 'z' and found it non-matching. But it decides to use only the leading "0" as value. That would mean pushing the 'z' and the 'x' back. (Yes I know that sscanf(), which I used here for easy testing, does not operate on a stream, but I strongly assume they made all ...scanf() functions behave identically for consistency.)

So... one-char ungetc() doesn't really to be the reason, here... ?:-/

Yes, results differ. I still cannot explain it properly, though... :-(

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • mingw-gcc 4.4.0, output: `Value: -1 - Consumed: -1 (sscanf())` / `Value: 0 - Consumed: 0 (strtoul())` – Christoph Sep 15 '09 at 13:35
  • I was using Cygwin GCC 3.4.4 here. So, not only are results *different*, they seem to be... well, not well-defined for scanf() either. :-( – DevSolar Sep 15 '09 at 13:38
  • http://sourceware.org/ml/newlib/2007/msg00585.html - a post to the newlib mailing list - indicates that the results I got are what the developers of newlib considered "correct", too. In the third patch hunk, you can see that they are indeed doing a double ungetc() there. /me still very confused. – DevSolar Sep 15 '09 at 13:42
  • Well JOLLY. GCC 4.1.2 on Gentoo: Value: 0 - Consumed: 2 - Next char: z - (sscanf()) Value: 0 - Consumed: 1 - Next char: x - (strtoul()) – DevSolar Sep 15 '09 at 14:05
  • What does `sscanf()` return in your case? I get `0`, which would be consistent with a matching failure and is what I would expect if `scanf()` und `strto*()` behave identically – Christoph Sep 15 '09 at 14:11
  • Return code of scanf() is 2, in both cases (matching the %x and the %c, %n does not count against return code as by spec. – DevSolar Sep 15 '09 at 14:30
  • I was able to reproduce your results using MSVC 2005. – DevSolar Sep 15 '09 at 14:39
  • AIX / xlC has yet another result to offer: Value: -1 - Consumed: -1 - Next char: � - (sscanf()) / Value: 0 - Consumed: 1 - Next char: x - (strtoul()) – DevSolar Sep 15 '09 at 15:30
  • 2
    if no one agrees, you're free to do what you think is correct; in my opinion, the result should be `Value: 0 - Consumed: 1 - Next char: x` for both functions; this means `fscanf()` has to look ahead two characters; if you can't unget the second one, you should set the stream's error indicator; I don't think you should silently consume the `x` as gentoo-gcc 4.1.2 does – Christoph Sep 15 '09 at 15:59
  • 1
    Another piece of information which may or not be relevant: http://sources.redhat.com/bugzilla/show_bug.cgi?id=1765 – AProgrammer Sep 15 '09 at 16:07
  • Seems to boil down to shoddy workmanship on the library part. – DevSolar Sep 15 '09 at 17:45
  • 1
    I agree with Christoph - the "sequence of letters and digits representing an integer" must be non-empty, otherwise it doesn't represent an integer, so in the `"0xz"` case the optional `"0x"` is *not* present, the sequence of digits is just `"0"` and the "final string" is therefore `"xz"`. – caf Sep 15 '09 at 21:43
0

I am not sure how implementing scanf() may be related to ungetc(). scanf() can use up all bytes in the stream buffer. ungetc() simply pushes a byte to the end of buffer and the offset is also changed.

scanf("%d", &x);
ungetc('9', stdin);
scanf("%d", &y);
printf("%d, %d\n", x, y);

If the input is "100", the output is "100, 9". I do not see how scanf() and ungetc() may interfere with each other. Sorry if I added a naive comment.

user172818
  • 4,518
  • 1
  • 18
  • 20
  • Not naive. Few people endeavour to implement standard lib functions. ;-) But ungetc()'ing is a bit more involved than just stepping back in the buffer. One, you might just have reached the end of the buffer, and read in new buffer contents - the old contents are no longer there. Two, your stream might not be buffered at all (consider setvbuf() and _IONBF). (Although in my lib I'm keeping a buffer even for _IONBF streams, as it makes things easier overall.) – DevSolar Sep 15 '09 at 20:04
0

For the input to the scanf() functions and also for strtol() functions, in Sec. 7.20.1.4 P7 indicates: If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of nptr is stored in the object pointed to by endptr, provided that endptr is not a null pointer. Also you must be considering that the rules of parsing those tokens which are defined under the rules of Sec. 6.4.4 Constants, rule that is pointed in Sec. 7.20.1.4 P5.

The rest of the behavior, such as the errno value, should be implementation specific. For example at my FreeBSD box I got EINVAL and ERANGE values and under Linux the same happens, where the standard referrers only to the ERANGE errno value.

daniel
  • 397
  • 3
  • 5
  • The part about invalid specifications does not apply - %x *is* a valid conversion specification. And while strtol() is at liberty to set endptr to nptr, fscanf() cannot do so as per footnote 251... – DevSolar Sep 19 '09 at 05:15
  • I know, but I just want to build more complete reference, and certainly I know that it is a valid spec. – daniel Sep 19 '09 at 13:42