0

I've got a copy of the book "The C Programming Language", and in one of the first chapters it shows a short example of a code that copies characters from stdin to stdout. Here is the code:

main()
{
    int c;
    c = getchar();
    while (c != EOF) {
        putchar(c);
        c = getchar();
    }
}

The writer then said, he used int instead of char because char isn't big enough to hold EOF. But when I tried it with char, it worked in the same way, even with ctrl+z. Stack says it’s a duplicate, so I ask shortly:

Why using ‘char’ is wrong?

  • Do you know how to type the character ÿ on your keyboard? Or maybe just copy-and-paste it out of this comment. Anyway, try running your `char`-using program with ÿ as input if you can. Your program will probably stop reading, as if you had typed control-Z. – Steve Summit Oct 10 '22 at 19:55
  • 2
    In general, when working with C or C++, "it works for me" is not a good enough reason to justify doing something a certain way. Undefined Behavior can produce many different outcomes, including code that appears to work 99% of the time. – 0x5453 Oct 10 '22 at 19:57
  • @SteveSummit: whether that works depends on whether the codeset used by the terminal is UTF-8 (it won't) or some single-byte code set such as ISO 8859-15 (it probably will). – Jonathan Leffler Oct 10 '22 at 19:57
  • @JonathanLeffler Yes, I'm gambling on the relative lack of support for UTF-8 in Windows. Some day I'll have to figure out if there's a UTF-8 use case that tickles the bug. – Steve Summit Oct 10 '22 at 19:59
  • @nonamedelete In general, `getchar` can return 257 different things: all 256 possibilities for an 8-bit character value, *plus* the value `EOF`, which is by definition not equal to any valid `char` value. But type `char` is (usually!) not big enough to hold 257 different values. – Steve Summit Oct 10 '22 at 20:03
  • @SteveSummit: Probably not — the 0xFF byte is simply invalid in UTF-8. Indeed, all the bytes from 0xF5-0xFF cannot occur in valid UTF-8 (and neither can 0xC0 or 0xC1). 0xFFFF might almost work if working with a UTF-16 encoding, but it is a non-character (see [Corrigendum #9](https://www.unicode.org/versions/corrigendum9.html)). – Jonathan Leffler Oct 10 '22 at 20:04
  • @JonathanLeffler Yeah, I was afraid of that. (So how are we supposed to convince the new kids that using `char` here is RONG? :-\ ) – Steve Summit Oct 10 '22 at 20:11
  • @SteveSummit: I fear that it will require the (mis)reading of binary data rather than keyboard input to generate the problem. It does get a bit fraught, but using `char` instead of `int` is still wrong, even though it's harder to demonstrate the issue. – Jonathan Leffler Oct 10 '22 at 20:37
  • @JonathanLeffler I'd *like* to just point 'em at [the FAQ list](https://c-faq.com/stdio/getcharc.html). (But if they're not gonna listen to K&R, who am I to imagine they'll listen to me? :-\ ) – Steve Summit Oct 10 '22 at 21:05
  • @JonathanLeffler: U+FFFF is an non-character, certainly, but `getchar` reads a byte at a time, and lots of UTF-16 codes contain a byte with the value 0xFF, starting with the BOM (which often occupied the first two bytes of a UTF-16 file). – rici Oct 11 '22 at 00:19

1 Answers1

0

If you will write for example

char c;
c = getchar();
while (c != EOF) {
//...

when in the condition c != EOF the value of the object c is promoted to the type int and two integers are compared.

The problem with declaring the variable c as having the type char is that the type char can behave either as the type signed char or unsigned char (depending on a compiler option). If the type char behaves as the type unsigned char then the expression c != EOF will always evaluate to logical true.

Pay attention to that according to the C Standard EOF is defined the following way

EOF which expands to an integer constant expression, with type int and a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream;

So after this assignment

c = getchar();

when c is declared as having the type char and the type char behaves as the type unsigned char then the value of c will be a positive value after the integer promotion and hence will not be equal to a negative value.

To simulate the situation just declare the variable c as having the type unsigned char

unsigned char c;
Vlad from Moscow
  • 301,070
  • 26
  • 186
  • 335
  • *If the type char behaves as the type unsigned char then the expression c != EOF will always evaluate to true* Yes, although we already know that didn't happen for the OP, who claims that control-Z still works. – Steve Summit Oct 10 '22 at 20:00
  • *The problem with declaring the variable c as having the type char is that the type char can behave either as the type signed char or unsigned char* Ah. So if OP had explicitly used `signed char c` or `unsigned char c`, the program would have been fine? :-) – Steve Summit Oct 10 '22 at 20:01
  • @SteveSummit getchar returns an integer value that represents an unsigned character except when EOF occurs. So using unsigned char as I pointed out results in an infinite loop. – Vlad from Moscow Oct 10 '22 at 20:05
  • @SteveSummit: On a Unix-like system, control-Z might suspend the current program and return the shell prompt. The program hasn't exited; it's merely stopped until it is restarted. I've seen people confused by that. – Jonathan Leffler Oct 10 '22 at 20:07
  • @VladfromMoscow I thought the OP's report that the program had *not* gone into an infinite loop ruled out that hypothesis. But Jonathan Leffler's supposition about control-Z on a Unix-like system throws that into doubt. – Steve Summit Oct 10 '22 at 20:10
  • @SteveSummit It seems that in his system the type char by default behaves as the type signed char. – Vlad from Moscow Oct 10 '22 at 20:11
  • @VladfromMoscow I agree. That's why I suggested typing ÿ. (Although, as Jonathan Leffler has also pointed out, that might not work, either, on a UTF-8 system.) – Steve Summit Oct 10 '22 at 20:13