5

We often use fgetc like this:

int c;
while ((c = fgetc(file)) != EOF)
{
    // do stuff
}

Theoretically, if a byte in the file has the value of EOF, this code is buggy - it will break the loop early and fail to process the whole file. Is this situation possible?

As far as I understand, fgetc internally casts a byte read from the file to unsigned char and then to int, and returns it. This will work if the range of int is greater than that of unsigned char.

What happens if it's not (probably then sizeof(int)=1)?

  • Will fgetc read a legitimate data equal to EOF from a file sometimes?
  • Will it alter the data it read from the file to avoid the single value EOF?
  • Will fgetc be an unimplemented function?
  • Will EOF be of another type, like long?

I could make my code fool-proof by an extra check:

int c;
for (;;)
{
    c = fgetc(file);
    if (feof(file))
        break;
    // do stuff
}

It is necessary if I want maximum portability?

anatolyg
  • 26,506
  • 9
  • 60
  • 134
  • No. The `if (feof()) {...}` is useless. The code inside the `{}` will never be reached. AFTER the QUESTION edit: `if (c == EOF) break;` is sufficient. no need to use `feof()`. after `c=fgetc()` c can be 0..0xff (assuming 8bit chars) for actual characters, **or** -1 for EOF (which is not a normal character. – wildplasser Sep 17 '15 at 23:14
  • You said it yourself: converts to `unsigned char` and then to `int`, so `0xFF` cannot be returned as `EOF`, `-1`. An `int` does not have a size of 1. – Weather Vane Sep 17 '15 at 23:14
  • 3
    see [Can sizeof(int) ever be 1 on a hosted implementation?](https://stackoverflow.com/questions/3860943/can-sizeofint-ever-be-1-on-a-hosted-implementation) – cremno Sep 17 '15 at 23:23
  • http://port70.net/~nsz/c/c11/n1570.html#7.21.7.1p2 – too honest for this site Sep 17 '15 at 23:25
  • 2
    @cremno: Just to note that while `sizeof(int)` can be `1`, this does not imply identical ranges for `signed char` and `int` or their unsigned counterparts. A 24 bit platform might very well define a range of `0..255` for `unsigned char`, but `0..(1UL<<24)-1` for `int`. (e.g. 56300 DSPs). – too honest for this site Sep 17 '15 at 23:30
  • 1
    The standard does not require `EOF` to be `-1`. It just has to be negative. – too honest for this site Sep 17 '15 at 23:33

3 Answers3

5

The C specification says that int must be able to hold values from -32767 to 32767 at a minimum. Any platform with a smaller int is nonstandard.

The C specification also says that EOF is a negative int constant and that fgetc returns "an unsigned char converted to an int" in the event of a successful read. Since unsigned char can't have a negative value, the value of EOF can be distinguished from anything read from the stream.*

*See below for a loophole case in which this fails to hold.


Relevant standard text (from C99):

  • §5.2.4.2.1 Sizes of integer types <limits.h>:

    [The] implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.

    [...]

    • minimum value for an object of type int

      INT_MIN -32767

    • maximum value for an object of type int

      INT_MAX +32767

  • §7.19.1 <stdio.h> - Introduction

    EOF ... expands to an integer constant expression, with type int and a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream

  • §7.19.7.1 The fgets function

    If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined)

If UCHAR_MAXINT_MAX, there is no problem: all unsigned char values will be converted to non-negative integers, so they will be distinct from EOF.

Now, there is a funny sort of loophole here: if a system has UCHAR_MAX > INT_MAX, then a system is legally allowed to convert values greater than INT_MAX to negative integers (per §6.3.1.3, the result of converting a value to a signed type that cannot represent that value is implementation defined), making it possible for a character read from a stream to be converted to EOF.

Systems with CHAR_BIT > 8 do exist (e.g. the TI C4x DSP, which apparently uses 32-bit bytes), although I'm not sure if they are broken with respect to EOF and stream functions.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • 4
    Ah, but `CHAR_BIT` can be bigger than 8, and for `CHAR_BIT >= 16` it's permissible for `sizeof(int)` to be 1. – EOF Sep 17 '15 at 23:27
  • Yuck, this is in fact a possible defect in the standard (albeit an extremely minor one). – nneonneo Sep 17 '15 at 23:42
  • Sure, but what if you have an embedded system or something that doesn't permit addressing at anything other than a word boundary? Do you define int to be larger than a word? Then you need multiple instructions to operate on a single int. Do you define char to be smaller than a word? You then need to store the char in a full word and use bitwise masking to get the part you need. Either option is at least as ugly as having sizeof(int) == 1. So the standard leaves it to the implementation to decide which option is least horrible. – Ray Sep 17 '15 at 23:49
  • @Ray: I'm not saying that `CHAR_BIT != 8` is bad; I'm saying that the standard apparently allows the possibility that a character read from a stream can be converted to `EOF`. (Note that the TI C4x DSP referenced in my answer has 32-bit everything, including `char`, `short`, `float`, `long`, `int`, `double`, except `long double` which is 40 bits). – nneonneo Sep 17 '15 at 23:52
  • 1
    Systems with `CHAR_BIT > 8` are very likely to be embedded systems, which means the C implementation is likely to be *freestanding*, which means that support for `` is not required. It's certainly *possible* to have a hosted implementation with `CHAR_BIT > 8`, but given current systems it's not likely. (On the other hand, who knows what might become popular in a few decades.) – Keith Thompson Sep 18 '15 at 00:20
  • @KeithThompson: I dare not think of the portability nightmare for general-purpose software on such a platform. – EOF Sep 18 '15 at 00:23
  • @EOF: I think you could check whether `fgetc()` returned `EOF` *and then* query `feof()` and `ferror()` to avoid false positives. If they both return false, then the apparent `EOF` returned by `fgetc()` is actually a valid character. Of course nobody bothers to do that, so existing code isn't portable to such systems. – Keith Thompson Sep 18 '15 at 00:27
  • @KeithThompson: I've worked with UNIX systems where `-1` is a valid system call return value...so you had to do `errno = EOK; ret = syscall(...); if(ret == -1 && errno != EOK) { ... }` – nneonneo Sep 18 '15 at 00:31
  • 1
    @KeithThompson: Oh, I wasn't even talking about `EOF`. Just the sheer number of programs that assume `sizeof(int) == sizeof(float) == 4` or even (shudder) `sizeof(int) == sizeof(void*) == 4`... – EOF Sep 18 '15 at 00:32
  • @EOF: Happily, the proliferation of 64-bit platforms is starting to erode that latter assumption (although in 20 years we might see people assume `sizeof(void *) == 8`...) – nneonneo Sep 18 '15 at 00:39
  • @nneonneo: I would think that could potentially vary for each system call. – Keith Thompson Sep 18 '15 at 00:40
  • 1
    @EOF This who recall 16-bit `int`, data pointer width != function pointer width do not make those "usual" assumptions. These issues happened before and will occur again. – chux - Reinstate Monica Sep 18 '15 at 01:11
5

Yes, c = fgetc(file); if (feof(file)) does work for maximum portability. It works in general and also when the unsigned char and int have the same number of unique values. This occurs on rare platforms with char, signed char, unsigned char, short, unsigned short, int, unsigned all using the same bit width and width of range.

Note that feof(file)) is insufficient. Code should also check for ferror(file).

int c;
for (;;)
{
    c = fgetc(file);
    if (c == EOF) {
      if (feof(file)) break;
      if (ferror(file)) break;
    }
    // do stuff
}
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • 3
    Although this would be overkill on the *vast* majority of systems, based on the discussions in the other answers' comments, I am now convinced that this is the only 100% portable way to do it, and the idea of an actual *hosted* implementation with sizeof(int) == 1 will now forever haunt my nightmares. – Ray Sep 18 '15 at 01:24
  • 1
    @Ray agree with the overkill, nightmares, etc. Truly robust C code is _hard_. – chux - Reinstate Monica Sep 18 '15 at 01:27
  • Note: `if (ferror(file)) break;` is a possible false-positive. The file error flag may be set before this code and `c == EOF` is due to a `unsigned char` whose value converter to `int` is `EOF`. (Only when `UCHAR_MAX > INT_MAX`). Then `ferror(file)` returns true due to the error flag's past history and without regard to the recent `fgetc()`: a _narrow_ hole. – chux - Reinstate Monica Dec 14 '19 at 22:21
0

NOTE: chux's answer is the correct one in the most general case. I'm leaving this answer up because I believe both the answer and the discussion in the comments are valuable in understanding the (rare) situations in which chux's approach is necessary.

EOF is guaranteed to have a negative value (C99 7.19.1), and as you mentioned, fgetc reads its input as an unsigned char before converting to int. So those by themselves guarantee that EOF can't be read from a file.

As for your specific questions:

  • fgetc can't read a legitimate datum equal to EOF. In the file, there's no such thing as signed or unsigned; it's just bit sequences. It's C that interprets 1000 1111 differently depending on whether it's being treated as signed or unsigned. fgetc is required to treat it as unsigned, so negative numbers (other than EOF) cannot be returned.

    Addendum: It can't read EOF for the unsigned char part, but when it converts the unsigned char to an int, if the int is not capable of representing all values of the unsigned char, then the behavior is implementation-defined (6.3.1.3).

  • fgetc is required by the standard for hosted implementations, but freestanding implementations are permitted to omit most of the standard library functions (some are apparently required, but I couldn't find the list.)

  • EOF won't require a long, since fgetc needs to be able to return it and fgetc returns an int.

  • As far as altering the data goes, it can't change the value exactly, but since fgetc is specified to read "characters" from the file as opposed to chars, it could potentially read in 8-bits at a time even if the system otherwise defines CHAR_BIT to be 16 (which is the minimum value it could have if sizeof(int) == 1, since INT_MIN <= -32767 and INT_MAX >= 32767 are required by 5.2.4.2). In that case, the input character would be converted to a unsigned char that just always had its high bits 0. Then it could make the conversion to int without losing precision. (In practice, this just won't come up, since machines don't generally have 16-bit bytes)

Ray
  • 1,706
  • 22
  • 30
  • No. An `unsigned char` could be 32-bit as well as `int`, so all the `unsigned char` combinations map to some `int` value, leaving no unused `int` value for `EOF`. On such systems `EOF` does overlay an `unsigned char` converted to `int`. – chux - Reinstate Monica Sep 18 '15 at 00:57
  • An implicit conversion isn't a bit-for-bit copy reinterpreted according to the new type. There are a series of rules that specify how the value changes when converting between types. In this case, we're converting to a signed type that can't represent the current value, so 6.3.1.3 applies and the value is implementation-defined or it raises an implementation-defined signal. So you're right, overlaying EOF onto a valid int value is a *possibility*, but it's not guaranteed. The good news is that the implementation is required to document what exactly happens at that point. – Ray Sep 18 '15 at 01:10
  • Agree. the salient issue is that an `unsigned char` converted to `int` _may_ have a negative value and it may equal `EOF`. – chux - Reinstate Monica Sep 18 '15 at 01:14