2

Are the standards saying that casting to wint_t and to wchar_t in the following two programs is guaranteed to be correct?

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "");
  wint_t wc;
  wc = getwchar();
  putwchar((wchar_t) wc);
}

--

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "");
  wchar_t wc;
  wc = L'ÿ';
  if (iswlower((wint_t) wc)) return 0;
  return 1;
}

Consider the case where wchar_t is signed short (this hypothetical implementation is limited to the BMP), wint_t is signed int, and WEOF == ((wint_t)-1). Then (wint_t)U+FFFF is indistinguishable from WEOF. Yes, U+FFFF is a reserved codepoint, but it's still wrong for it to collide.

I would not want to swear that this never happens in real life without an exhaustive audit of existing implementations.

See also May wchar_t be promoted to wint_t?

Community
  • 1
  • 1
Igor Liferenko
  • 1,499
  • 1
  • 13
  • 28

1 Answers1

1

On the environment you describe, wchar_t cannot accurately describe the BMP: L'\uFEFF' exceeds the range of wchar_t as its type is the unsigned equivalent to wchar_t. (C11 6.4.4.4 Character constants p9). Storing it to wchar_t defined as signed short, assuming 16-bit shorts, changes its value.

On the other hand, if the charset used for the source code is Unicode and the compiler is properly configured to parse its encoding correctly, L'ÿ' has the value 255 with an unsigned type, so the code in the second example is perfectly defined and unambiguous.

If int is 32-bit wide and short 16-bit wide, it seems much more consistent to define wchar_t as either int or unsigned short. WEOF can then be defined as (-1), a value different from all values of wchar_t or at least all values representing Unicode code-points.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • But returning `(-1)` to an `unsigned short` is equivalent to 0xFFFF which would be a valid character in unicode, or am I forgetting something? – Kami Kaze Nov 23 '16 at 08:41
  • You seem to have confused UTf-8 and codepoint in your middle paragraph, in UTF-8 the octet 255 is not valid, and `L'ÿ'` in UTF-8 is encoded as two octets, C3 BF. – Pete Kirkham Nov 23 '16 at 08:55
  • @PeteKirkham: I rephrased the answer for clarity, the charset and the file encoding are 2 different matters, we are not concerned about the encoding, as long as it is correct and properly configured. – chqrlie Nov 23 '16 at 11:35
  • @KamiKaze: `0xFFFF` is actually an invalid Unicode code-point, but if `wchar_t` is unsigned, it would be a valid `wchar_t` value, and it would make sense that it be different from `WEOF`, which can be achieved by defining `WEOF` as `(-1)` and `wint_t` as `int`. – chqrlie Nov 23 '16 at 11:38