43

My view is that a C implementation cannot satisfy the specification of certain stdio functions (particularly fputc/fgetc) if sizeof(int)==1, since the int needs to be able to hold any possible value of unsigned char or EOF (-1). Is this reasoning correct?

(Obviously sizeof(int) cannot be 1 if CHAR_BIT is 8, due to the minimum required range for int, so we're implicitly only talking about implementations with CHAR_BIT>=16, for instance DSPs, where typical implementations would be a freestanding implementation rather than a hosted implementation, and thus not required to provide stdio.)

Edit: After reading the answers and some links references, some thoughts on ways it might be valid for a hosted implementation to have sizeof(int)==1:

First, some citations:

7.19.7.1(2-3):

If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined).

If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the endof-file indicator for the stream is set and the fgetc function returns EOF. Otherwise, the fgetc function returns the next character from the input stream pointed to by stream. If a read error occurs, the error indicator for the stream is set and the fgetc function returns EOF.

7.19.8.1(2):

The fread function reads, into the array pointed to by ptr, up to nmemb elements whose size is specified by size, from the stream pointed to by stream. For each object, size calls are made to the fgetc function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object. The file position indicator for the stream (if defined) is advanced by the number of characters successfully read.

Thoughts:

  • Reading back unsigned char values outside the range of int could simply have undefined implementation-defined behavior in the implementation. This is particularly unsettling, as it means that using fwrite and fread to store binary structures (which while it results in nonportable files, is supposed to be an operation you can perform portably on any single implementation) could appear to work but silently fail. essentially always results in undefined behavior. I accept that an implementation might not have a usable filesystem, but it's a lot harder to accept that an implementation could have a filesystem that automatically invokes nasal demons as soon as you try to use it, and no way to determine that it's unusable. Now that I realize the behavior is implementation-defined and not undefined, it's not quite so unsettling, and I think this might be a valid (although undesirable) implementation.

  • An implementation sizeof(int)==1 could simply define the filesystem to be empty and read-only. Then there would be no way an application could read any data written by itself, only from an input device on stdin which could be implemented so as to only give positive char values which fit in int.

Edit (again): From the C99 Rationale, 7.4:

EOF is traditionally -1, but may be any negative integer, and hence distinguishable from any valid character code.

This seems to indicate that sizeof(int) may not be 1, or at least that such was the intention of the committee.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • even if sizeof(char) == sizeof(int), are ints and chars required to represent the same range ? i.e. could a system provide a 16 bit char which you're only guaranteed to be able to use ,say,8 bit values, while an int makes use of all 16 (or - CHAR_MAX being less than INT_MAX, etc.) ? – nos Oct 05 '10 at 18:30
  • @nos: No. `sizeof()` is in terms of `unsigned char` units, which are the fundamental representation of any type. See "Representation of Types" (6.2.6) in the C standard. The other direction is possible, though; some bits of `int` could be padding bits, trap bits, etc. – R.. GitHub STOP HELPING ICE Oct 05 '10 at 23:21
  • @nos: I take that back. If `sizeof(int)` is 1, `int` cannot have any padding bits/trap bits due to the integer conversion rank and promotion rules in 6.3.1.1. Specifically, paragraph 3 says "The integer promotions preserve value including sign." This also means that if `sizeof(int)` is 1 and `signed char` is twos complement, `int` must also be twos complement (or `SCHAR_MIN` could not be preserved by promotion). – R.. GitHub STOP HELPING ICE Oct 06 '10 at 16:27
  • @R: If sizeof(int) is one, 'int' could have extra padding/trap bits iff those same padding/trap bits exist for 'char'. Likewise, if sizeof(int) is not one, a 'char' may have extra padding bits if such bits also serve as padding in larger types. For example, a machine with 13-bit memory and registers could 'pretend' to be an 8-bit machine, if the unused bits did not affect the behavior of any legitimate program. – supercat Dec 08 '10 at 05:23
  • @supercat: Padding/trap bits for `char` *do not exist* as far as the formal language is concerned. That doesn't mean they're not there in the hardware. It means they're unobservable and therefore irrelevant. – R.. GitHub STOP HELPING ICE Dec 08 '10 at 13:10
  • @R: They would be unobservable in any legitimate program. That does not mean that they would might not have effects on an illegitimate program. For example, an implementation which uses parity-checked memory could deliberately mis-set the parity bits on any memory holding uninitialized data, or an which was more focused on correctness than efficiency could tag every byte in an array with the address of its 'base', allowing for precise trapping of out-of-bounds access. A legitimate program would never see such things, but that wouldn't mean they'd be of no interest to a programmer. – supercat Dec 08 '10 at 15:37
  • @R: Also, I'd expect that a hardware implementation which e.g. only had 64-bit floating-point maths could decide act like a C implementation with a 51-bit char/int/long type, if all "unsigned" integer operations were done on such quantities, and all divisions were truncated. Signed integer operations could simply be done on floats directly, provided they were truncated, since accessing a signed int outside its defined range is UB. Is there any requirement that the maximum "defined" range for a signed type be smaller than for unsigned? – supercat Dec 08 '10 at 15:41
  • @supercat: there would be no way even for an illegitimate program to see or write such "padding bits". You seem to be assuming the existence of an `asm` keyword or other way of writing machine code, which is outside the scope of the C language. There would be **absolutely no way**, using just C code, to access such padding bits in `char`, so from a formal standpoint, they don't exist. – R.. GitHub STOP HELPING ICE Dec 08 '10 at 16:14
  • Regarding your float-based implementation, even if the range of signed types is required to be smaller than the range for unsigned types, you just declare it as smaller in `limits.h`. Behavior on overflow is **undefined**, so it doesn't matter if larger values somehow get generated. – R.. GitHub STOP HELPING ICE Dec 08 '10 at 16:17
  • @R: An implementation could provide certain means of writing such bits via C code, in data which a legitimate program would be forbidden from reading. You are correct in noting that because undefined behavior is precisely that, a bit which can only be read using undefined behavior does not, from a standards standpoint, exist. Nonetheless, since it may be desirable to have an implementation ensure that undefined behavior won't cause nasal demons, even though the spec doesn't require it, having extra bits could sometimes be useful. – supercat Dec 08 '10 at 18:26
  • @R: BTW, I've sometimes thought it would be useful for a C compiler to offer 'unchecked' unsigned types, whose out-of-range behavior would be explicitly UB. That would variables to have a shorter type in RAM than in registers--a useful optimization for RAM-conscious code. – supercat Dec 08 '10 at 18:40

8 Answers8

24

It is possible for an implementation to meet the interface requirements for fgetc and fputc even if sizeof(int) == 1.

The interface for fgetc says that it returns the character read as an unsigned char converted to an int. Nowhere does it say that this value cannot be EOF even though the expectation is clearly that valid reads "usually" return positive values. Of course, fgetc returns EOF on a read failure or end of stream but in these cases the file's error indicator or end-of-file indicator (respectively) is also set.

Similarly, nowhere does it say that you can't pass EOF to fputc so long as that happens to coincide with the value of an unsigned char converted to an int.

Obviously the programmer has to be very careful on such platforms. This is might not do a full copy:

void Copy(FILE *out, FILE *in)
{
    int c;
    while((c = fgetc(in)) != EOF)
        fputc(c, out);
}

Instead, you would have to do something like (not tested!):

void Copy(FILE *out, FILE *in)
{
    int c;
    while((c = fgetc(in)) != EOF || (!feof(in) && !ferror(in)))
        fputc(c, out);
}

Of course, platforms where you will have real problems are those where sizeof(int) == 1 and the conversion from unsigned char to int is not an injection. I believe that this would necessarily the case on platforms using sign and magnitude or ones complement for representation of signed integers.

CB Bailey
  • 755,051
  • 104
  • 632
  • 656
  • 1
    Passing `EOF` to `fputc` is completely valid because the argument is converted to `unsigned char` before being written. Thus `fputc(EOF)` is equivalent to `fputc(UCHAR_MAX)`. The conversion in the other direction, however, is undefined behavior if `UCHAR_MAX>INT_MAX`. – R.. GitHub STOP HELPING ICE Oct 05 '10 at 16:46
  • @R..: Yes, you're completely correct about passing `EOF` to `fputc`. Then conversion from an `unsigned char` that can't be represented as an `int` value to `int` does not cause _undefined behaviour_, though, it is _implementation defined_. This is important because it allows an implementation to support the `fputc`/`fgetc` round trip. – CB Bailey Oct 05 '10 at 19:01
  • 1
    You are correct. I assumed the behavior was the same as signed arithmetic overflow, but conversion to a signed type is implementation-defined (according to 6.3.1.3 paragraph 3) as you say. – R.. GitHub STOP HELPING ICE Oct 05 '10 at 23:26
10

I remember this exact same question on comp.lang.c some 10 or 15 years ago. Searching for it, I've found a more current discussion here:

http://groups.google.de/group/comp.lang.c/browse_thread/thread/9047fe9cc86e1c6a/cb362cbc90e017ac

I think there are two resulting facts:

(a) There can be implementations where strict conformance is not possible. E.g. sizeof(int)==1 with one-complement's or sign-magnitude negative values or padding bits in the int type, i.e. not all unsigned char values can be converted to a valid int value.

(b) The typical idiom ((c=fgetc(in))!=EOF) is not portable (except for CHAR_BIT==8), as EOF is not required to be a separate value.

Secure
  • 4,268
  • 1
  • 18
  • 16
  • @R..: Ah, now I see. `fputc` couldn't work in my solution without "encoding" the character. So that excludes binary streams, but that's not a huge deal. – Potatoswatter Oct 05 '10 at 18:10
5

I don't believe the C standard directly requires that EOF be distinct from any value that could be read from a stream. At the same time, it does seem to take for granted that it will be. Some parts of the standard have conflicting requirements that I doubt can be met if EOF is a value that could be read from a stream.

For example, consider ungetc. On one hand, the specification says (§7.19.7.11):

The ungetc function pushes the character specified by c (converted to an unsigned char) back onto the input stream pointed to by stream. Pushed-back characters will be returned by subsequent reads on that stream in the reverse order of their pushing. [ ... ] One character of pushback is guaranteed.

On the other hand, it also says:

If the value of c equals that of the macro EOF, the operation fails and the input stream is unchanged.

So, if EOF is a value that could be read from the stream, and (for example) we do read from the stream, and immediately use ungetc to put EOF back into the stream, we get a conundrum: the call is "guaranteed" to succeed, but also explicitly required to fail.

Unless somebody can see a way to reconcile these requirements, I'm left with considerable doubt as to whether such an implementation can conform.

In case anybody cares, N1548 (the current draft of the new C standard) retains the same requirements.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
3

Would it not be sufficient if a nominal char which shared a bit pattern with EOF was defined as non-sensical? If, for instance, CHAR_BIT was 16 but all the allowed values occupied only the 15 least significant bits (assume a 2s-complement of sign-magnitude int representation). Or must everything representable in a char have meaning as such? I confess I don't know.

Sure, that would be a weird beast, but we're letting our imaginations go here, right?

R.. has convinced me that this won't hold together. Because a hosted implementation must implement stdio.h and if fwrite is to be able to stick integers on the disk, then fgetc could return any bit pattern that would fit in a char, and that must not interfere with returning EOF. QED.

dmckee --- ex-moderator kitten
  • 98,632
  • 24
  • 142
  • 234
2

I'm not so familiar with C99, but I don't see anything that says fgetc must produce the full range of values of char. The obvious way to implement stdio on such a system would be to put 8 bits in each char, regardless of its capacity. The requirement of EOF is

EOF

which expands to an integer constant expression, with type int and a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream

The situation is analogous to wchar_t and wint_t. In 7.24.1/2-3 defining wint_t and WEOF, footnote 278 says

wchar_t and wint_t can be the same integer type.

which would seem to guarantee that "soft" range checking is sufficient to guarantee that *EOF is not in the character set.

Edit:

This wouldn't allow binary streams, since in such a case fputc and fgetc are required to perform no transformation. (7.19.2/3) Binary streams are not optional; only their distinctness from text streams is optional. So it would appear that this renders such an implementation noncompliant. It would still be perfectly usable, though, as long as you don't attempt to write binary data outside the 8-bit range.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • You mean "8 **bits** in each char", right? In C, a byte has CHAR_BIT bits. And `wchar_t` has the same requirements as `char`. – schot Oct 05 '10 at 06:30
  • @Potatoswatter: Great compression scheme ;) I'm not sure if this 'fixes' it. I haven't found anything that forbids it yet. – schot Oct 05 '10 at 06:42
  • @schot: Well… it seems necessary to interoperability with files that aren't pre-padded. The alternative isn't actually any less dense; you need to address those ASCII characters somehow. – Potatoswatter Oct 05 '10 at 06:47
  • I mean more dense. Wow, I should quit for the evening. – Potatoswatter Oct 05 '10 at 06:56
  • If you only put 8 bits in each `char`, then `CHAR_BIT==8` and we're outside the domain of the question. Now it's very possible that someone using an implementation with `CHAR_BIT==64` would still only want to store 8 bits in each `char` when dealing with text data (in ASCII or UTF-8, for example), but this does not change the fact that `char` is an integer type capable of representing its entire range, and that `fgetc` and `fputc` work on binary data. – R.. GitHub STOP HELPING ICE Oct 05 '10 at 16:35
  • The value of `WEOF` is implementation-defined. The value of `EOF` is -1. – R.. GitHub STOP HELPING ICE Oct 05 '10 at 16:58
  • Actually both are implementation-defined, but `EOF` is required to be negative, while `WEOF`'s sign is not specified. – R.. GitHub STOP HELPING ICE Oct 06 '10 at 16:44
2

I think you are right. Such an implementation cannot distinguish a legitimate unsigned char value from EOF when using fgetc/fputc on binary streams.

If there are such implementations (this thread seems to suggest there are), they are not strictly conforming. It is possible to have a freestanding implementation with sizeof (int) == 1.

A freestanding implementation (C99 4) only needs to support the features from the standard library as specified in these headers: <float.h>, <iso646.h>, <limits.h>, <stdarg.h>, <stdbool.h>, <stddef.h>, and <stdint.h>. (Note no <stdio.h>). Freestanding might make more sense for a DSP or other embedded device anyway.

schot
  • 10,958
  • 2
  • 46
  • 71
1

You are assuming that the EOF cannot be an actual character in the character set. If you allow this, then sizeof(int) == 1 is OK.

Šimon Tóth
  • 35,456
  • 20
  • 106
  • 151
  • Is this allowed? I seriously doubt it. Citation either way? – R.. GitHub STOP HELPING ICE Oct 05 '10 at 16:32
  • `EOF` cannot be a value of `unsigned char`, as its value is -1. However, what may be possible (it's unclear to me) is whether the standard allows an implementation to have some values of `unsigned char` which cannot be represented in `int` (their conversion to `int`, which it specifies happens, would then have undefined behavior). – R.. GitHub STOP HELPING ICE Oct 05 '10 at 16:44
  • @R For citation check the C standard. `EOF` definitely isn't defined as `-1`. Signed to unsigned conversion isn't defined per standard but we are talking about specific platform here. – Šimon Tóth Oct 06 '10 at 09:26
  • 2
    Signed to unsigned conversion is defined by the C standard. It's reduction modulo 2^N. Unsigned to signed conversion is defined by the standard when the value is representable; when it's not, the result is implementation-defined. Sorry about `EOF` and -1. It's a negative constant of type `int`, not necessarily -1, but that does not change the fact that it cannot be a value of `unsigned char`. – R.. GitHub STOP HELPING ICE Oct 06 '10 at 16:19
  • Just to revisit this answer, it appears to be correct because the assumption "EOF cannot be an actual character in the character set" *must* hold, per the above comments. The question is whether a value of `unsigned char`, outside the range of `int`, may alias `EOF` after the implementation-defined, overflow-oriented conversion. There's no reason it can't, and the accepted answer has more depth, but this deserves a vote. – Potatoswatter Jun 15 '15 at 02:23
1

The TI C55x compiler I am using has a 16bit char and 16bit int and does include a standard library. The library merely assumes an eight bit character set, so that when interpreted as a character as char of value > 255 is not defined; and when writing to an 8-bit stream device, the most significant 8 bits are discarded: For example when written to the UART, only the lower 8 bits are transferred to the shift register and output.

Clifford
  • 88,407
  • 13
  • 85
  • 165
  • I'm pretty sure that is not a conformant hosted implementation. – R.. GitHub STOP HELPING ICE Oct 05 '10 at 23:23
  • No, but it could be made one. :-) – Prof. Falken Oct 02 '11 at 07:36
  • From TI's documents: [*On targets where `sizeof(char) == sizeof(int)` (C2700, C2800, C5400, C5500), you still can't reliably use the return value of `getc()` to check for end of file, because 0xffff will be mistaken for the end of file. Use `feof()` instead*](http://processors.wiki.ti.com/index.php/C89_Support_in_TI_Compilers#Misunderstandings_about_TI_C) – phuclv Jun 26 '18 at 02:16
  • @phuclv possibly so in the current documentation and library. Not necessarily so in 2011 when this answer was written, and the compiler I was using was old even then. – Clifford Jun 26 '18 at 07:02