6

From The C Programming Language:

int c;
while ((c = getchar()) != EOF)
    putchar(c);

"... The solution is that getchar returns a distinctive value when there is no more input, a value that cannot be confused with any real character. This value is called EOF, for "end of file." We must declare c to be a type big enough to hold any value that getchar returns. We can't use char since c must be big enough to hold EOF in addition to any possible char."

I checked in stdio.h and printed the value of EOF on my system, and it's set to -1. On my system, chars are signed, although I understand that this is system dependent. So, EOF can fit in a char for my system. I rewrote the small routine above by defining c to be a char and the program works as intended. There's also a character in the ASCII character table here that appears to have a blank character corresponding to 255 which appears to act like EOF.

So, why does it appear that ASCII has a character (255) designated for EOF? This seems to contradict what is said in the The C Programming Language book.

JustinBlaber
  • 4,629
  • 2
  • 36
  • 57

5 Answers5

5

When getchar() reads the byte 255, it returns 255. When getchar() finds that there is no more input, it returns -1.

If you store the result in a char, you cannot distinguish the two. But when you store them in an int, you can. (This statement is independent of the signedness of char).

Only if you know that the result was valid can you convert it to char and get the usual C-style character type.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
4

So, why does it appear that ASCII has a character (255) designated for EOF?

It hasn't. More precisely, that's not the EOF "character".

The trick is, getchar() will always return non-negative values if it has something to read. It will only return -1 (that's what EOF appears to be defined on your implementation) if it encounters end-of-file.

The fact that char is:

  1. 8 bits wide,
  2. signed and
  3. uses a 2's complement representation,

is just a quirk of your implementation (although overwhelmingly common nowadays). Thus, if you are using a char to store the return value of getchar(), then reading the input may terminate prematurely: the character with code 255 will be mistaken for -1 a. k. a. EOF, which is an error. This is just what happened to you. It didn't work -- conversely, your second approach was completely broken.

  • Side note: there exists an ASCII character named "EOF" with the value 26 (0x1a), but that's not relevant to the question. –  Oct 31 '13 at 20:29
  • EOF is system dependent, so I'm sure we'll find other variations. – givanse Oct 31 '13 at 20:39
  • @givanse Yes, but `EOF` is *required* to be negative by the standard. –  Oct 31 '13 at 20:41
  • @H2CO3 Just curious, but after you check that a character is valid, will any problems arise from storing it in a `char` since overflow can occur (if `char` is signed)? – JustinBlaber Oct 31 '13 at 20:42
  • @jucestain Signed integer overflow is undefined behavior if caused by an arithmetic operator (see [this question/answer](http://stackoverflow.com/questions/18922601/is-char-foo-255-undefined-behavior-if-char-is-signed)), but not when caused by an initialization, so I'd say `char ch = i;` should be fine. –  Oct 31 '13 at 20:45
  • @H2CO3 It still says the results are implementation defined... Does this mean that the character set should correspond correctly to how the system handles assigning an out-of-range value? Or should I just specify `char` as being `unsigned`? – JustinBlaber Oct 31 '13 at 20:58
  • @jucestain The results of what? The conversion? It is IB, and you can't really do anything about it. If you want to be absolutely safe, use `unsigned char`. –  Oct 31 '13 at 21:01
  • The results of `char foo = 255` when `char` is signed as per the question/answer you posted. I.e. its possible for `getchar()` to return the value `255` which I store in a `char`. I think just to be safe I'll use `unsigned char` when using `getchar ()`. – JustinBlaber Oct 31 '13 at 21:06
  • @jucestain Or just use the common (and correct) idiom and use `int`, and you will then be *completely* safe. –  Oct 31 '13 at 21:08
3

According to manual on getchar() it always returns int value:

#include <stdio.h>
...
int getchar(void);
...
RETURN VALUE
fgetc(), getc() and getchar() return the character read as 
an unsigned char cast to an int or EOF on end of file or error.

Thus using char instead of int will cause truncation (int -1 (0xffffffff) becomes char -1 (0xff)) and may cause errors.

Michael
  • 1,505
  • 14
  • 26
  • Thanks, my initial problem was I didn't realize `getchar` reads a byte as an `unsigned char` and then casts it to an `int`. Returning `-1' for `EOF` makes sense after realizing this. – JustinBlaber Nov 06 '13 at 19:13
2

To understand how this works imagine what was the guy writing getchar thinking. You need to read a file. Start by creating a routine - for example:

unsigned char get_me_a_byte(file)... // 0..255

now you want to read all bytes from a file:

unsigned char c;

while( c = get_me_a_byte(file) ) // while( (c = get_me_a_byte(file)) != 0 )
{
  ... do sth
}

The problem is that it will stop when z zero is encountered but you want to stop once everything is red. Now you are getting smarter - you know files can be thought of as sequence of bytes. What if your get_me_a_byte could return 16 or 32 bit type? Then you could use some value that byte cannot hold as end of file marker.

bingo

Since decision is yours you may have:

int get_me_a byte_U(file) ... // returning bytes as 0..255
int get_me_a byte_S(file) ... // returning bytes as -128..127

Now you can do:

int c;
while( (c = get_me_a_byte_U(file) != UUU ) ....

where UUU could be anything from 256 to MAXINT on your platform

Similarly:

int c;
while( (c = get_me_a_byte_S(file) != SSS ) ....

where SSS could be anything from MININT..-129 and 128..MAXINT

Now if you chose first method there is a question: What should value of UUU (your EOF) be?

(-1) is good for EOF because regardless of what is the bit width of variable you may assign it to it will remain (-1). By 'remain -1' I mean it will always be all ones pattern.

char c = -1; // c = 11111111b / 0xFF / 255 (assuming your char is signed 8bit)
short s = -1; // s = 1111111111111111b / 0xFFFF / 65535
int i = -1; // s = 11111111111111111111111111111111b / 0xFFFFFFFF / 4294967295

Now it should be obvious.

Artur
  • 7,038
  • 2
  • 25
  • 39
1

There is no contradiction.

  • EOF is NOT a character, just a condition found when reading a file.
  • ASCII 255 sometimes corresponds to a non-breaking space a.k.a HTML entity &nbsp;

As noted in the comments, ASCII encodes only 128 characters, so beyond that you'll find different encodings.

From the table that you linked to I would just say:

255 is a non printable character

givanse
  • 14,503
  • 8
  • 51
  • 75
  • 3
    ASCII only has 0 .. 127. There are dozens of character sets that use the other byte values in incompatible ways, but none of them is ASCII (and the term "extended ASCII" for any one of them is also misleading). –  Oct 31 '13 at 20:28