5

I wrote a complete application in C99 and tested it thoroughly on two GNU/Linux-based systems. I was surprised when an attempt to compile it using Visual Studio on Windows resulted in the application misbehaving. At first I couldn't assert what was wrong, but I tried using the VC debugger, and then I discovered a discrepancy concerning the fscanf() function declared in stdio.h.

The following code is sufficient to demonstrate the problem:

#include <stdio.h>

int main() {
    unsigned num1, num2, num3;

    FILE *file = fopen("file.bin", "rb");
    fscanf(file, "%u", &num1);
    fgetc(file); // consume and discard \0
    fscanf(file, "%u", &num2);
    fgetc(file); // ditto
    fscanf(file, "%u", &num3);
    fgetc(file); // ditto
    fclose(file);

    printf("%d, %d, %d\n", num1, num2, num3);

    return 0;
}

Assume that file.bin contains exactly 512\0256\0128\0:

$ hexdump -C file.bin
00000000  35 31 32 00 32 35 36 00  31 32 38 00              |512.256.128.|

Now, when being compiled under GCC 4.8.4 on an Ubuntu machine, the resulting program reads the numbers as expected and prints 512, 256, 128 to stdout.
Compiling it with MinGW 4.8.1 on Windows gives the same, expected result.

However, there seems to be a major difference when I compile the code using Visual Studio Community 2015; namely, the output is:

512, 56, 28

As you can see, the trailing null characters have already been consumed by fscanf(), so fgetc() captures and discards characters that are essential to data integrity.

Commenting out the fgetc() lines makes the code work in VC, but breaks it in GCC (and possibly other compilers).

What is going on here, and how do I turn this into portable C code? Have I hit undefined behavior? Note that I'm assuming the C99 standard.

rhino
  • 13,543
  • 9
  • 37
  • 39
  • 1
    You are using text reading functions, yet you open the file in binary mode, and it contains no text-type line endings. – Weather Vane Feb 23 '17 at 16:12
  • 1
    I'm not sure that reading a file containing `NUL` characters with text reading functions such as `fscanf`is a good idea anyway. – Jabberwocky Feb 23 '17 at 16:13
  • Could you try a simple experiment for me? Include `` on the misbehaving compiler, and see what you get when you run `printf("%d\n", isspace('\0'))`? – Sergey Kalinichenko Feb 23 '17 at 16:16
  • @WeatherVane opening with "r" instead of "rb" doesn't help. – Jabberwocky Feb 23 '17 at 16:17
  • @dasblinkenlight `printf("%d\n", isspace('\0'))` prints `0`. But it's rather the library that misbehaves than the compiler. – Jabberwocky Feb 23 '17 at 16:18
  • Meaning the zero is not recognized as space character. I wonder what would be return values of the functions (check them, please..). – Eugene Sh. Feb 23 '17 at 16:19
  • 1
    @dasblinkenlight I did just now, and I can confirm that it is 0. – rhino Feb 23 '17 at 16:19
  • Adding an extra `%n` tot the format strings (plus an extra argument) could also help – joop Feb 23 '17 at 16:24
  • I have a suspicion that `fgets` is consuming the trailing null byte in case of MSVC. – Eugene Sh. Feb 23 '17 at 16:28
  • @joop each `fscanf(file, "%u%n", &number, &count)` stores `4` in `count`. – rhino Feb 23 '17 at 16:34
  • 2
    You could make it portable by reading byte-by-byte with `getchar` and making the number conversions yourself. – Weather Vane Feb 23 '17 at 16:39
  • Could be a problem in 2015 version. My old VS2008 behaves the same as gcc: fscanf %u does not consume the null and count is 3. But as a null should not occur in a text file and as fscanf is intended to read text file... I've just controlled: according to the standard, isspace('\0') shall be false (5.4.1.10) *The standard white-space characters are the following: space (' '), form feed ('\f'), new-line ('\n'), carriage return ('\r'), horizontal tab ('\t'), and vertical tab ('\v'). In the "C" locale, isspace returns true only for the standard white-space characters.* – Serge Ballesta Feb 23 '17 at 16:47
  • I suspect using `"%3u"` will solve this issue with _this_ set of data, but not a good general solution. – chux - Reinstate Monica Feb 23 '17 at 16:53
  • If the data is *always* in sets of 4, a less painful way than my last suggestion would be to `fread` in groups of 4 into a `char[]` array which will then be a 0-terminated string for you to convert. – Weather Vane Feb 23 '17 at 16:54
  • 1
    I suggest a solution that takes care of 2 problems, 1) problem stated here, 2) testing that the character following the digits is in fact truly a _null character_ - this is not tested in present code. Create a byte-by-byte function that handles this as commented by @WeatherVane – chux - Reinstate Monica Feb 23 '17 at 16:56

2 Answers2

8

TL;DR: you've been bitten by MSVC non-conformance, a longstanding problem that MS has never shown much interest in solving. If you must support MSVC in addition to conforming C implementations, then one way to do so would be to engage conditional compilation directives to suppress the fgetc() calls when the program is compiled via MSVC.


I'm inclined to agree with the comments that reading binary data via formatted I/O functions is a questionable plan. Even more questionable, however, is the combination of

compil[ing] it using Visual Studio on Windows

and

assuming the C99 standard.

As far as I am aware, no version of MSVC conforms to C99. Very recent versions may do a better job of conforming to C2011, in part because C2011 makes some features optional that were mandatory in C99.

Whichever version of MSVC you're using, however, I think it fails to conform with the standard (both C99 and C2011) in this area. Here is the relevant text from C99, section 7.19.6.2

A conversion specification is executed in the following steps:

[...]

An input item is read from the stream [...]. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. The first character, if any, after the input item remains unread.

The standard is quite clear that the first character that does not match the input sequence remains unread, so the only ways MSVC could be considered conforming is if the \0 characters could be construed as being part of (and terminating) a matching input sequence, or if fgetc() were permitted to skip \0 characters. I see no justification for the latter, especially given that the stream was opened in binary mode, so let's consider the former.

For a u conversion specifier, a matching input sequence is defined as one that

Matches an optionally signed decimal integer, whose format is the same as expected for the subject sequence of the strtoul function with the value 10 for the base argument.

The "subject sequence of the strtoul function" is defined in that function's specifications:

First, they decompose the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence resembling an integer represented in some radix determined by the value of base, and a final string of one or more unrecognized characters, including the terminating null character of the input string.

Note in particular that the terminating null character is explicitly attributed to the final string of unrecognized characters. It is not part of the subject string, and therefore should not be matched by fscanf() when it converts input according to a u specifier.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
2

The MSVC implementation of fscanf is apparently "trashing" the NUL character next to the 512:

fscanf(file, "%u", &num1);

According to the fscanf documentation, this should not take place (emphasis mine):

For every conversion specifier other than n, the longest sequence of input characters which does not exceed any specified field width and which either is exactly what the conversion specifier expects or is a prefix of a sequence it would expect, is what's consumed from the stream. The first character, if any, after this consumed sequence remains unread.

Note that this is different than the situation when one would desire to skip trailing white characters as in following statement:

fscanf(file, "%u ", &num1); // notice "%u "

The spec says, that this occurs, only when the characters are identified by isspace property, which as checked, is not holding here (that is, isspace('\0') yields 0).

A hacky, regex-like workaround, that works in both MSVC and GCC may be to replace fgetc with:

fscanf(file, "%*1[^0-9+-]"); // skip at most one non-%u character

or more portably by replacing implementation-defined 0-9 character class with literal digits:

fscanf(file, "%*1[^0123456789+-]"); // skip at most one non-%u character
Grzegorz Szpetkowski
  • 36,988
  • 6
  • 90
  • 137
  • A `'-'` in a negated _scanlist_ is tricky as the 2nd `-` in `[^0-9+-]` looks like it is introducing another range. Suggest `"%*1[^-+0123456789]"`. – chux - Reinstate Monica Feb 23 '17 at 18:29
  • @chux: I agree, It's tricky, however I believe it's okay. C11 7.21.6.2/12 says: _If a `-` character is in the scanlist and is not the first, nor the second where the first character is a `^`, **nor the last character**, the behavior is implementation-defined._ – Grzegorz Szpetkowski Feb 23 '17 at 18:35
  • 1
    LSNED --> UV. Still, `"%*1[^0-9+-]"` is implementation-defined due to the first `-`, yet reasonable. – chux - Reinstate Monica Feb 23 '17 at 19:39
  • See [`scanf()` asking twice for input while expect it to ask only once](http://stackoverflow.com/questions/15740024/scanf-asking-twice-for-input-while-i-expect-it-to-ask-only-one/15740124#15740124) and [Trailing blank in `scanf()` format strings](http://stackoverflow.com/questions/19499060/what-is-difference-between-scanfd-and-scanfd) for discussions about the inadvisability of using trailing blanks in `scanf()` format strings. They're outrageously awful if a user might ever type the input to the program. – Jonathan Leffler Oct 22 '17 at 17:05