2

It seems everyone assumes wint_t is at least as large as wchar_t. However C standard allows wchar_t range to have value that do not directly correspond to any character in extended character set:

The values WCHAR_MIN and WCHAR_MAX do not necessarily correspond to members of the extended character set.

and:

wchar_t , which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; the null character shall have the code value zero.

and wint_t is required to be able to hold only values for members of extended character set and at least one additional value for WEOF:

wint_t , which is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set (see WEOF below);

The requirement of wint_t being unchanged by default argument promotions also does not imply wint_t is larger than wchar_t, as wchar_t may also be large enough to be unchanged by default argument promotions.

So in some imaginary implementation, wchar_t might be defined large enough to hold many unnecessary values that are not in extended character set and also to bypass default argument promotions. This implementation then may choose to not to include those values in wint_t. This allows wchar_t to be larger than wint_t.

According to standard size of wchar_t must be at least 1 byte and wint_t at least 2 byte. (assuming 8-bit bytes)

Also in Microsoft Visual Studio wint_t is typedefed to unsigned short. How this satisfies requirement of being unchanged by default argument promotions? I thought C allows 2-byte wint_t because int may be 2 byte in some implementation.

phuclv
  • 37,963
  • 15
  • 156
  • 475
cryptain
  • 33
  • 5
  • This [note](https://port70.net/~nsz/c/c11/n1570.html#note327) seems relevant. – KamilCuk Mar 30 '20 at 23:22
  • @KamilCuk, wchar_t and wint_t may be same integer type but that does not mean they have to be. though in usual implementation which use UTF encodings they are. my concern is about c in general specially unusual implementations. – cryptain Mar 30 '20 at 23:30
  • 1
    https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html `in the GNU C Library wchar_t is always 32 bits wide` – KamilCuk Mar 30 '20 at 23:33
  • gcc uses UTF-32 so wchar_t is 4 bytes. – cryptain Mar 30 '20 at 23:35
  • In the GNU C library, `wchar_t` is a `typedef int` and `wint_t` is a `typedef unsigned int`. – Aplet123 Mar 31 '20 at 00:33
  • @Aplet123 , so they are different types in GNU. GNU's unsigned int will remain unchanged by default argument promotions however Visual studio's unsigned short does not ! i wonder then is Visual studio violates c standard ? – cryptain Mar 31 '20 at 00:45
  • `wint_t` cannot be a typedef for `unsigned short` in Standard C , if that is narrower than `int` – M.M Mar 31 '20 at 01:56
  • MS has never attempted to conform to any C standard beyond C90 though – M.M Mar 31 '20 at 01:57
  • This question seems to answer itself – M.M Mar 31 '20 at 02:01
  • @M.M thanks. what about first part do wchar_t can be larger than wint_t ? – cryptain Mar 31 '20 at 02:08
  • well you quoted the requirements yourself, there is nothing specifying one must be larger than the other – M.M Mar 31 '20 at 02:13
  • @M.M actually yes, because 0xFFFF is not a valid Unicode code point so it's outside the extended character set and can be used for WEOF – phuclv Aug 25 '20 at 03:35
  • @KamilCuk `in the GNU C Library wchar_t is always 32 bits wide` that's not true. If you set [`-fwide-exec-charset=UTF-16`](https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html) then wchar_t will be a 2-byte type in GCC – phuclv Aug 25 '20 at 12:04
  • @phuciv `-fwide-exec-charset` doesn't change the size of `wchar_t`. It changes the representation, not size of type. [aaaand godbolt link](https://godbolt.org/z/hca585) – KamilCuk Aug 25 '20 at 12:05

1 Answers1

0

wint_t to wchar_t is the same as what int to char, therefore an implementation where sizeof(wchar_t) == sizeof(wint_t) is completely legal, just as implementations where sizeof(int) == sizeof(char) are allowed. In fact for the char case it's even worse because you can't return a different type for getc, fgetc... whereas for wint_t you can simply typedef it as a wider type if necessary. You can also see that the standard even explicitly permits it

Footnote 327) wchar_t and wint_t can be the same integer type.

http://www.iso-9899.info/n1570.html#7.29.1

The standard also said that "The values WCHAR_MIN and WCHAR_MAX do not necessarily correspond to members of the extended character set" and there's nothing wrong with that. The extended character set range can be smaller than wchar_t range because the same happens in char. For example if the basic character set is ASCII then it uses only half of the available range (or much less if CHAR_BIT > 8). wint_t is

... an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set (see WEOF below);

http://www.iso-9899.info/n1570.html#6.3.1.3

so presumably its size may be even smaller than wchar_t if the extended character set is much smaller than the wchar_t set. Since 0xFFFF is guaranteed not to be a Unicode character at all, using it for WEOF is completely valid, although it's a little bit weird IMHO and I don't know why MS did that

If sizeof(wchar_t) == sizeof(wint_t) or sizeof(int) == sizeof(char) then there are also values that char and wchar_t can represent but int and wint_t can't in case char/wchar_t is unsigned. In that case the conversion between them is implementation defined. That won't be any issues if you're working on text files although it'll cause problems if you're reading binary files. Anyway in that case for portability you need to explicitly test for EOF and error yourself

int c;
while((c = /* fgetwc(in) */ fgetc(in)) != EOF || (!feof(in) && !ferror(in)))
    fputc(c, out);

This is the same as what TI suggested

On targets where sizeof(char)==sizeof(int) (C2700, C2800, C5400, C5500), you still can't reliably use the return value of getc() to check for end of file, because 0xffff will be mistaken for the end of file. Use feof() instead.

CMU's FIO34-C. Distinguish between characters read from a file and EOF or WEOF also said that

Because EOF is negative, it should not match any unsigned character value. However, this is only true for implementations where the int type is wider than char. On an implementation where int and char have the same width, a character-reading function can read and return a valid character that has the same bit-pattern as EOF. This could occur, for example, if an attacker inserted a value that looked like EOF into the file or data stream to alter the behavior of the program.

The C Standard requires only that the int type be able to represent a maximum value of +32767 and that a char type be no larger than an int. Although uncommon, this situation can result in the integer constant expression EOF being indistinguishable from a valid character; that is, (int)(unsigned char)65535 == -1. Consequently, failing to use feof() and ferror() to detect end-of-file and file errors can result in incorrectly identifying the EOF character on rare implementations where sizeof(int) == sizeof(char).

This problem is much more common when reading wide characters. The fgetwc(), getwc(), and getwchar() functions return a value of type wint_t. This value can represent the next wide character read, or it can represent WEOF, which indicates end-of-file for wide character streams. On most implementations, the wchar_t type has the same width as wint_t, and these functions can return a character indistinguishable from WEOF.

In the UTF-16 character set, 0xFFFF is guaranteed not to be a character, which allows WEOF to be represented as the value -1. Similarly, all UTF-32 characters are positive when viewed as a signed 32-bit integer. All widely used character sets are designed with at least one value that does not represent a character. Consequently, it would require a custom character set designed without consideration of the C programming language for this problem to occur with wide characters or with ordinary characters that are as wide as int.

See also

phuclv
  • 37,963
  • 15
  • 156
  • 475