1

wchar_t is defined in wchar.h

Currently, if the developers want to use only wchar_t, they can not do this without getting type conversion warnings from the compiler. If wchar_t will be made the same type as wint_t, it will be good for both parties. The developers who want to have both wint_t and wchar_t in their programs (for example if they want their code to be compiled not only under glibc) can do this without getting compiler warnings. The developers who want to use only wchar_t (to avoid unnecessary hassle with using wint_t and explicit typecasting) can also do this without getting compiler warnings. And it will not bring any incompatibility or portability problems, except that if code using only wchar_t will be compiled on the machine which uses original wchar.h, the compiler will print those pesky warnings (if -Wconversion is enabled), but the compiled program will work absolutely the same way.

The C standard (9899:201x 7.29) mentions:

wchar_t and wint_t can be the same integer type.

Also, in glibc wide characters are always ISO10646/Unicode/UCS-4, so they always use 4 bytes. Thus, nothing prevents making wchar_t the same type as wint_t in glibc.

But it seems that developers of glibc do not want to make wint_t and wchar_t the same type for some reason. As such, I want to change the local copy of wchar.h.

ISO10646/Unicode/UCS-4 uses 2^31 values for the extended character set (MSB being unused):

0xxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

Notice, that a 4-byte type can hold 2^31 extra values (MSB being "1"):

1xxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

Any of those extra values can be used to represent WEOF, thus one 4-byte type can be used to hold all the character set and WEOF.

Notice, that no recompilation of glibc is necessary to use the modified wchar.h, because wint_t can be signed or unsigned (since both -1 and 0xffffffff have MSB "1", in any representation, and since MSB is not used in ISO10646/Unicode/UCS-4).

Definition of wchar_t is done somewhere in the following excerpt from wchar.h. How to change it to make wchar_t the same type as wint_t?

#ifndef _WCHAR_H

#if !defined __need_mbstate_t && !defined __need_wint_t
# define _WCHAR_H 1
# include <features.h>
#endif

#ifdef _WCHAR_H
/* Get FILE definition.  */
# define __need___FILE
# if defined __USE_UNIX98 || defined __USE_XOPEN2K
#  define __need_FILE
# endif
# include <stdio.h>
/* Get va_list definition.  */
# define __need___va_list
# include <stdarg.h>

# include <bits/wchar.h>

/* Get size_t, wchar_t, wint_t and NULL from <stddef.h>.  */
# define __need_size_t
# define __need_wchar_t
# define __need_NULL
#endif
#if defined _WCHAR_H || defined __need_wint_t || !defined __WINT_TYPE__
# undef __need_wint_t
# define __need_wint_t
# include <stddef.h>

/* We try to get wint_t from <stddef.h>, but not all GCC versions define it
   there.  So define it ourselves if it remains undefined.  */
# ifndef _WINT_T
/* Integral type unchanged by default argument promotions that can
   hold any value corresponding to members of the extended character
   set, as well as at least one value that does not correspond to any
   member of the extended character set.  */
#  define _WINT_T
typedef unsigned int wint_t;
# else
/* Work around problems with the <stddef.h> file which doesn't put
   wint_t in the std namespace.  */
#  if defined __cplusplus && defined _GLIBCPP_USE_NAMESPACES \
      && defined __WINT_TYPE__
__BEGIN_NAMESPACE_STD
typedef __WINT_TYPE__ wint_t;
__END_NAMESPACE_STD
#  endif
# endif

/* Tell the caller that we provide correct C++ prototypes.  */
# if defined __cplusplus && __GNUC_PREREQ (4, 4)
#  define __CORRECT_ISO_CPP_WCHAR_H_PROTO
# endif
#endif

#if (defined _WCHAR_H || defined __need_mbstate_t) && !defined ____mbstate_t_defined
# define ____mbstate_t_defined  1
/* Conversion state information.  */
typedef struct
{
  int __count;
  union
  {
# ifdef __WINT_TYPE__
    __WINT_TYPE__ __wch;
# else
    wint_t __wch;
# endif
    char __wchb[4];
  } __value;        /* Value so far.  */
} __mbstate_t;
#endif
#undef __need_mbstate_t


/* The rest of the file is only used if used if __need_mbstate_t is not
   defined.  */
#ifdef _WCHAR_H

# ifndef __mbstate_t_defined
__BEGIN_NAMESPACE_C99
/* Public type.  */
typedef __mbstate_t mbstate_t;
__END_NAMESPACE_C99
#  define __mbstate_t_defined 1
# endif

#ifdef __USE_GNU
__USING_NAMESPACE_C99(mbstate_t)
#endif

#ifndef WCHAR_MIN
/* These constants might also be defined in <inttypes.h>.  */
# define WCHAR_MIN __WCHAR_MIN
# define WCHAR_MAX __WCHAR_MAX
#endif

#ifndef WEOF
# define WEOF (0xffffffffu)
#endif

/* For XPG4 compliance we have to define the stuff from <wctype.h> here
   as well.  */
#if defined __USE_XOPEN && !defined __USE_UNIX98
# include <wctype.h>
#endif


__BEGIN_DECLS

__BEGIN_NAMESPACE_STD
/* This incomplete type is defined in <time.h> but needed here because
   of `wcsftime'.  */
struct tm;
__END_NAMESPACE_STD
/* XXX We have to clean this up at some point.  Since tm is in the std
   namespace but wcsftime is in __c99 the type wouldn't be found
   without inserting it in the global namespace.  */
__USING_NAMESPACE_STD(tm)
phuclv
  • 37,963
  • 15
  • 156
  • 475
Igor Liferenko
  • 1,499
  • 1
  • 13
  • 28
  • BTW, it would be not bad if glibc developers added some documentation about what for are used `__need_wint_t`, `__need_mbstate_t`, `__WINT_T`, `__WINT__TYPE`, etc... I can't make head or tail of this cryptic code. – Igor Liferenko Nov 21 '16 at 05:29

2 Answers2

2

Note that wint_t was introduced because wchar_t might be a type subject to 'default promotion' rules when passed to printf() et al. This matters, for example, when calling printf():

wchar_t wc = …;
printf("%lc", wc);

The value of wc might be converted to wint_t. If you're writing a function like printf() which needs to use the va_arg() macro from <stdarg.h>, then you should use the type wint_t to get the value.

The standard notes that wint_t might be the same type as wchar_t, but if wchar_t is a (16-bit) short (or unsigned short), wint_t might be (32-bit) int. To a first approximation, wint_t only matters when wchar_t is a 16-bit type. The full rules are, of course, more complex. For example, int could be a 16-bit type — but this is rarely a problem.

ISO/IEC 9899:2011

7.29 Extended multibyte and wide character utilities <wchar.h>

7.29.1 Introduction

¶1 The header <wchar.h> defines four macros, and declares four data types, one tag, and many functions.326)

2 The types declared are wchar_t and size_t (both described in 7.19);

mbstate_t

which is a complete object type other than an array type that can hold the conversion state information necessary to convert between sequences of multibyte characters and wide characters;

wint_t

which is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set (see WEOF below);327)

326) See ‘‘future library directions’’ (7.31.16).
327) wchar_t and wint_t can be the same integer type.

§7.19 Common definitions <stddef.h>

¶2 … and

wchar_t

which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; the null character shall have the code value zero. Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define __STDC_MB_MIGHT_NEQ_WC__.

See Why the argument type of putchar(), fputc(), and putc() is not char for one place where the 'default promotion' rules from the C standard are quoted. There are probably other questions where the information is available too.

Community
  • 1
  • 1
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • So, to sum it up, `wint_t` is **only** relevant when `wchar_t` is 16-bit data type? Can we use some pragma directly in the code if we do not want to maintain this backward compatibility with OS Windows (which is the only one to use 16-bit `wint_t`) and always use `wchar_t`? (but avoid type conversion warnings) – Igor Liferenko Nov 21 '16 at 06:01
  • Regarding "`wint_t` is only relevant when `wchar_t` is a 16-bit data type", the answer is more or less — I've mildly updated the answer to discuss that point. I'm sorry, but I don't have enough experience with (current) Windows and (current) Windows compilers to know whether there's a way to deal with the problems you outline in your comment. – Jonathan Leffler Nov 21 '16 at 06:14
1

If we need to avoid type conversion warnings when -Wconversion compiler option is used, we need to change wint_t to wchar_t in the prototypes of all library functions, and put '#define WEOF (-1)' to the beginning of wchar.h and wctype.h

For wchar.h the command is:

sudo perl -i -pe 'print qq(#define WEOF (-1)\n) if $.==1; next unless /Copy SRC to DEST\./..eof; s/\bwint_t\b/wchar_t/g' /usr/include/wchar.h

For wctype.h the command is:

sudo perl -i -pe 'print qq(#define WEOF (-1)\n) if $.==1; next unless /Wide-character classification functions/..eof; s/\bwint_t\b/wchar_t/g' /usr/include/wctype.h

Similarly, if you use other header files which use wint_t, simply change wint_t to wchar_t in the prototypes in those header files.

Explanation follows.

Some Unix systems define wchar_t as a 16-bit type and thereby follow Unicode very strictly. This definition is perfectly fine with the standard, but it also means that to represent all characters from Unicode and ISO 10646 one has to use UTF-16 surrogate characters, which is in fact a multi-wide-character encoding. But resorting to multi-wide-character encoding contradicts the purpose of the wchar_t type.

Now, the only encoding to survive for data exchange is UTF-8, and the maximum number of data bits that it can hold is 31:

1111110x    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx

So, you see that in practice it is not necessary to have wint_t as a separate type (because 4-byte (i.e., 32 bit) data types are used to store Unicode code points anyway). Maybe it has some applications for "backward compatibility" or something, but in new code it is pointless. Once again, because it defeats the purpose of having wide characters at all (and not being able to handle UTF-8 makes no sense in using wide characters nowadays).

Notice, that de-facto wint_t is not used anyway. For example, see example in man mbstowcs. There the variable of type wchar_t is passed to iswlower() and other functions from wctype.h, which take wint_t.

Igor Liferenko
  • 1,499
  • 1
  • 13
  • 28
  • Note that although the coding scheme underlying UTF-8 could be extended to handle up to 31 bits (or even further under some not unreasonable assumptions), in practice, the Unicode Consortium has defined that Unicode is a 21-bit code set with the maximum code point U+10FFFF, which turn in means that 4 bytes are sufficient to encode any Unicode code point in UTF-8. The reason for the choice is that the UTF-16 surrogate pairs can encode up to U+10FFFF and no further. See the [UTF-8, UTF-16, UTF-32 and BOM FAQ](http://www.unicode.org/faq/utf_bom.html) for more information. – Jonathan Leffler Nov 21 '16 at 06:22
  • @JonathanLeffler All right, but 4 bytes are needed for the data type to store that 21-bit codepoint internally, because 3-byte data type is not practical. Even if it was, there would still be space for WEOF, So, anyway, having `wint_t` gains nothing, but creates inconveniences. There *should* be some compiler directive to make `wint_t` the same type as `wchar_t`, and I'm trying to find out what it is. But cannot comprehend wchar.h code... – Igor Liferenko Nov 21 '16 at 06:29
  • Yes, `wint_t` is a nuisance, but it is logically necessary to allow `printf()` to be written using Standard C. You're right that no modern machines have 24-bit integer types — historically, I believe there were some, but they pre-date my use of computers, so to store Unicode code points, you use a 32-bit integer and UTF-32 (and it doesn't matter much whether that's signed or not since the values will always be positive). – Jonathan Leffler Nov 21 '16 at 06:31
  • @JonathanLeffler The thing which completely escapes from me is why glibc developers made `wint_t` and `wchar_t` different signedness. If they were both signed or both unsigned - there would not be any warnings and everyone would be happy. Also, your point about `printf()` is still not clear to me. I feel that I'm close to understand this mess... Maybe it is worth to ask a separate question like "how to implement printf() from scratch?" – Igor Liferenko Nov 21 '16 at 06:38
  • For the rationale of `wint_t` and `wchar_t` having different signedness, you'd have to consult with the designers of the GNU C Library (glibc). While `EOF` is required to be negative (but not necessarily `-1`, though that is by far the most common value), there isn't a similar requirement on `WEOF`. I've not gone through all the steps needed to support wide characters in `printf()`, so I'm not completely clear on all the ins and outs myself. There are some denizens on SO who have implemented such code, I believe; maybe you can manage to get them to notice your question. – Jonathan Leffler Nov 21 '16 at 06:44
  • I'm not sure whether any of the information in [Prefix L changed the original bytes and doesn't follow UTF-8 any more](http://stackoverflow.com/questions/40477009/) will help. Probably not. Similar comments apply to [How does `printf()` work with `%lc`?](http://stackoverflow.com/questions/40700031/c-how-work-printf-with-lc) – Jonathan Leffler Nov 21 '16 at 06:47
  • @JonathanLeffler BTW, do you know why `EOF` is required to be `-1`? Why they did not make it, say, `65535`? – Igor Liferenko Nov 21 '16 at 07:05
  • `EOF` is not required by the C standard to be `-1`, but it is required to be negative. It's required to be negative so that you can tell the difference between a valid input from `getchar()` and friends by testing for the sign of the `int` (NB: `int`! — I know you know) returned. As to why `-1`, that is a useful value for the `isalpha()` etc functions/macros in ``. Those are required to accept any `unsigned char` value plus `EOF`. If `EOF` is `-1`, then you can use an array of 257 entries to characterize 256 valid characters and EOF, which simplifies the macro implementation. – Jonathan Leffler Nov 21 '16 at 07:11
  • @JonathanLeffler then why "testing for the sign of the `int`" is called "sloppy code" in glibc reference? (http://www.sbin.org/doc/glibc/libc_6.html) I mean, what is wrong with `xxx != EOF`? As for the macro implementation part, this makes sense. – Igor Liferenko Nov 21 '16 at 07:55
  • @JonathanLeffler It is still strange for what purpose `islower()` and friends are required to handle `EOF`, but `putchar()`, `putc()` and `fputc()` are not. Why `islower()` and friends do not treat `int` as `unsigned char` the same way as it is done in `putchar()` and freinds? This would make total sense, because we have to check for `EOF` first anyways. – Igor Liferenko Nov 21 '16 at 08:23