How is char_traits::eof() encoded in platforms where sizeof(int) == 1?

Question

I found these excerpts in the C++ standard (quotations taken from N4687, but it will likely have been there since forever):

[char.traits.typedefs]

For a certain character container type char_type, a related container type INT_T shall be a type or class which can represent all of the valid characters converted from the corresponding char_type values, as well as an end-of-file value, eof().

[char.traits.require]

Expression: X::eof()

Type: X::int_type

Returns: a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for all values c.

Expression: X::eq_int_type(e,f)

Type: bool

Returns: for all c and d, X::eq(c,d) is equal to X::eq_int_type(X::to_int_type(c), X::to_int_type(d)) (...)

c and d denote values of type CharT; (...); e and f denote values of type X::int_type

[char.traits.specializations.char]

using char_type = char;
using int_type = int;

[basic.fundamental]

Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. (...) A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (...) For narrow character types, all bits of the object representation participate in the value representation. (...) For unsigned narrow character types, each possible bit pattern of the value representation represents a distinct number.

There are five standard signed integer types : “signed char”, “short int”, “int”, “long int”, and “long long int”. In this list, each type provides at least as much storage as those preceding it in the list.

I haven't found anything preventing sizeof(int) == 1 in the surrounding text. This is obviously not the case in most modern platforms where sizeof(int) is 4 or 8 but is explicitly used as an example e.g. in cppreference:

Note: this allows the extreme case in which bytes are sized 64 bits, all types (including char) are 64 bits wide, and sizeof returns 1 for every type.

The question

If int was as large as char, the standard does not leave much space for any object representation of the former that would compare inequal to all values (via to_int_type) of the latter, leaving just some corner cases (like negative zero existing in signed char but mapping to INT_MIN in int) unlikely to be implemented efficiently in hardware. Moreover, with P0907 it seems even signed char will not allow any two different bit strings representing equal values, thus forcing it to 2^(bitsize) distinct values, the int as well, and closing every possible loophole.

How, in such platform, would one conform to the requirements of std::char_traits<char>? Do we have a real-world example of such platform and the corresponding implementation?

`INT_T` is not necessary an `int`, it is arbitrary type that can hold any char plus EOF. — user7860670, Oct 03 '19 at 08:52
It wouldn't be `sizeof(int)` that would make a difference, at least not directly. It would be whether the implementation provides an integral type that can represent all the values that a `char` can, plus an end-of-file value. There is no requirement that two types of the same size can represent the same sets of values. — Peter, Oct 03 '19 at 08:53
@VTT: it is `int`: http://eel.is/c++draft/char.traits.specializations.char — geza, Oct 03 '19 at 08:55
@Peter: In general I agree, of course, but based on the quotations an n-bit `unsigned char` (and `signed` as well, in C++20) must allow a distinct value for each possible bit pattern, and thus each possible object representation of that size. If `int` has the same size, there's just none left, mathematically there's no way it could have more than 2^(bitsize) values. (And it needs at least one value per each possible `char` to satisfy that `eq_int_type` does not give false positives.) — The Vee, Oct 03 '19 at 08:59
Minimal range for `int` is `-32767` to `32767`, so with `sizeof (char) == 1 == sizeof(int)`. we just have to limit `char` to not use the full range of `int`. — Jarod42, Oct 03 '19 at 08:59
@Jarod42 I think that's not allowed. "For narrow character types, all bits of the object representation participate in the value representation." — The Vee, Oct 03 '19 at 09:03
So nb bits of `char` of 15, and 16 for `int` seems legal.(IIRC, there are plaforms for which nb bits of `char` is 7). — Jarod42, Oct 03 '19 at 09:05
@Jarod42: then how could you check the underlying representation of an `int` with a `char`? — geza, Oct 03 '19 at 09:11
@Jarrod42 - that would surprise me, as the standard requires a `signed char` can represent at least the range `-127` to `127` and an `unsigned char` can represent a range from `0` to at least `255`. Bit hard to do either with an 7-bit representation. — Peter, Oct 03 '19 at 09:11
@Jarod42 Wouldn't you have a very serious problem with [basic.types] paragraph 2 if a type of `sizeof==k` wasn't covered by `k` `char`s? — The Vee, Oct 03 '19 at 09:12
@Peter An `int` can't be 8 bit, it's required to cover the range mentioned by Jarod42. In my scenario a `char` would have to be at least 16 bit, then. — The Vee, Oct 03 '19 at 09:14
@TheVee - I did not suggest that an `int` can be represented using 8-bit. You're correcting me on an assertion I did not make. I was responding to the suggestion that a `char` can be 7 bit. — Peter, Oct 03 '19 at 09:24
@Peter: Second answer from [what-is-char-bit](https://stackoverflow.com/questions/3200954/what-is-char-bit) seems to say that some "old" machines got 7bit byte. (but indeed, now, C++ requires at least 8). — Jarod42, Oct 03 '19 at 09:42
@Jarod42 - All C standards, from the first - i.e. C89/C90 has required `CHAR_BIT` to be `8` or more. C++, from the start, was designed for backward compatability with standard C (C89/C90) so has always had `char` of 8 bits or more. I have heard mention of some pre-standard C (i.e. K&R) compilers that had a 7-bit `char` (which is consistent with the base ASCII character set, which only has values between `0` and `127`) but never used such a beast. — Peter, Oct 03 '19 at 12:34
Does it say that all char values should be considered a valid character? Note 224 specifically says that EOF may actually be a valid char value. — L. F., Oct 03 '19 at 12:50
@L.F. I noticed that note but I am baffled by it. I only took it as a warning. Nevertheless, `std::char_traits::eof()` is `EOF` which is a macro for `-1` in GCC and that is a value that can be held in `char` so I guess the situation is quite real. However, the former is an `int` value, rather than `char`, so should only be compared after conversion via the provided functions. If compared via `to_int_type` they are -1 and 255, respectively, so the comparison is `false` and no problem here. (If compared via `to_char_type` the result is unspecified.) — The Vee, Oct 03 '19 at 14:12
Looks like a defect in the standard, in that the definition in char.traits.require is wrong. char.traits.typedef says "all of the valid characters converted from the corresponding `char_type` values, as well as an end-of-file value, `eof()`". The simple, obvious and practical way out for a conforming implementation is to make sure that not all `char_type` values are used to represent characters -- "valid character" and "value of type `char`" are not the same thing. It's char.traits.require that tries to take a shortcut by dropping the connection to actual characters. — Jeroen Mostert, Oct 04 '19 at 12:40
@JeroenMostert That must be it. Would you care for expanding the comment into an answer so the question doesn't stay unanswered? Also, if so, how does one go about reporting suspected defects? — The Vee, Oct 06 '19 at 07:22
I don't actually know how the C++ standards committee goes about its business -- it's even been a long time since I've read the C standard, so I'm hesitant to make definitive statements. What I know is that when I looked into implementations where `sizeof([all]) == 1` (out of sheer curiosity) the C standard didn't have the equivalent of the char.traits.require definition. There are real systems that have `sizeof([all]) == 1`, but most are embedded (DSPs and the like) so exactly conforming to the standard is not a big issue, nor is reading files. (And of course most just use C.) — Jeroen Mostert, Oct 06 '19 at 09:36

score 2 · Answer 1 · answered Oct 13 '22 at 00:55

Suppose, for example, that we had a platform where char is signed and 32 bits long, and int is as well. It is possible to satisfy all the requirements using the following definitions, where X is std::char_traits<char>:

X::eq and X::eq_int are simple equality comparisons;
X::to_char_type returns the value of its argument;
X::eof returns -1, and
X::to_int_type(c) is c, unless c is -1, in which case it's -2.

The mapping of -1 onto -2 guarantees that X::eq_int_type(X::eof(), X::to_int_type(c)) is false for all c, which is the requirement on X::eof according to C++20 Table 69.

This would probably correspond to an implementation where -1 and -2 (and maybe even all negative numbers), are "invalid" character values, i.e., they're completely legal to store in a char, but reading from a file will never yield a byte with such a value. Of course, nothing would stop you from writing a custom stream buffer that yields such "invalid" values as long as you're willing to accept the fact that it will not be possible to distinguish between "the next character is -1" and "we are at the end of the stream".

The only possible issue with this implementation is that the requirement on X::to_char_type(e) is that it equals

if for some c, X::eq_int_type(e,X::to_int_type(c)) is true, c; else some unspecified value.

This could be read as implying that if any such c exists, then it is unique. That would be violated when e is -2, because c here could be either -1 or -2.

If we assume that uniqueness is required, then I don't think there's any possible solution.

How is char_traits::eof() encoded in platforms where sizeof(int) == 1?

The question

1 Answers1