26

C++11 introduces char16_t and char32_t to facilitate working with UTF-16- and UTF-32-encoded text strings. But the <iostream> library still only supports the implementation-defined wchar_t for multi-byte I/O.

Why has support for char16_t and char32_t not been added to the <iostream> library to complement the wchar_t support?

Marc Mutz - mmutz
  • 24,485
  • 12
  • 80
  • 90
oz1cz
  • 5,504
  • 6
  • 38
  • 58
  • 4
    Have you tried `std::basic_iostream`? Just because there's no predefined types (like `std::iostream` for `char`) doesn't mean there is no support. – Some programmer dude Nov 17 '11 at 14:49
  • 4
    I've just tested `basic_istringstream` in GCC version 4.7.0. It compiles, but crashes during execution. This, of course, does not prove that support _could_ be present in another environment, but I still find it strange that the standardization committee did not include support on an equal footing with wchar_t. – oz1cz Nov 17 '11 at 14:56
  • I mean, "... does not _disprove_ that ...". – oz1cz Nov 17 '11 at 15:04
  • 3
    basic_istringstream and should work fine. If it doesn't in GCC then it's just a bug or that they haven't gotten to that yet. – bames53 Nov 17 '11 at 15:07
  • @bames53 : The standard doesn't require support beyond `char` and `wchar_t` -- all other character types are strictly implementation-defined, so not supporting them isn't necessarily a "bug". – ildjarn Nov 17 '11 at 17:44
  • @Mooing : §27.2.2/2 says otherwise. This is specific to streams, not `char_traits` or containers. – ildjarn Nov 17 '11 at 18:00
  • @ClausTøndering: `basic_istringstream` (and similar) all default the second argument to `std::char_traits`. You'll have to give it _both_ template arguments. – Mooing Duck Nov 17 '11 at 18:06
  • @ildjarn: Well I'll be... That's bizzare. It clearly states `char, wchar_t, and any other implementation-defined character types...` – Mooing Duck Nov 17 '11 at 18:08
  • @ildjarn I read §27.2.2/2 as saying not that support beyond char and wchar_t is implementation defined, but instead that if there are other character types that satisfy the requirements for a character on which any of the iostream components can be instantiated, then those types are supported. char16_t and char32_t seem to fit that or at least I don't see any requirements they don't fulfill for iostreams. I would be curious to find out why those types aren't listed explicitly in §27.2.2/2 though. Just an oversight? – bames53 Nov 17 '11 at 18:59
  • @bames53 : It's ambiguous certainly, and I read it just the opposite way -- that support for any character types beyond `char` and `wchar_t` is implementation-defined. Also, I'm not sure that streams could be expected to work directly with `char16_t` in particular, because that data type implies the possibility of multi-byte character sequences (surrogate pairs in this case), and I'm not aware that streams can use multi-byte sequences without a non-default facet. That said, std iostreams are certainly not my area of expertise. – ildjarn Nov 17 '11 at 19:09
  • @ildjarn The standard does specify codecvt does UTF-16 (§ 22.3.1.1.1, Table 81) at least. There is a footnote in § 22.4.1.4.2/3 "Informally, this means that basic_filebuf assumes that the mappings from internal to external characters is 1 to N: a codecvt facet that is used by basic_filebuf must be able to translate characters one internal character at a time." I think that requirement can be managed using by using a shift state, and there's a note right above that that seems to explicitly support that. Anyway I'm still working on becoming and expert myself :) – bames53 Nov 17 '11 at 19:24
  • basic_istringstream compiles with errors under gcc 4.6.2 – Jim Michaels Jan 21 '12 at 21:32

1 Answers1

22

In the proposal Minimal Unicode support for the standard library (revision 2) it is indicated that there was only support among the Library Working Group for supporting the new character types in strings and codecvt facets. Apparently the majority was opposed to supporing iostream, fstream, facets other than codecvt, and regex.

According to minutes from the Portland meeting in 2006 "the LWG is committed to full support of Unicode, but does not intend to duplicate the library with Unicode character variants of existing library facilities." I haven't found any details, however I would guess that the committee feels that the current library interface is inappropriate for Unicode. One possible complaint could be that it was designed with fixed sized characters in mind, but Unicode completely obsoletes that as, while Unicode data can use fixed sized code points, it does not limit characters to single code points.

Personally I think there's no reason not to standardized the minimal support that's already provided on various platforms (Windows uses UTF-16 for wchar_t, most Unix platforms use UTF-32). More advanced Unicode support will require new library facilities, but supporting char16_t and char32_t in iostreams and facets won't get in the way but would enable basic Unicode i/o.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • @bames53 there is no in the libstdc++ source tree: http://gcc.gnu.org/git/?p=gcc.git;a=tree;f=libstdc%2B%2B-v3/include/std;hb=HEAD – rubenvb Jun 19 '12 at 14:13
  • @rubenvb yeah, libstdc++ doesn't have it yet. As far as I know only [libc++](http://llvm.org/svn/llvm-project/libcxx/trunk/include/codecvt) and Dinkumware have it. – bames53 Jun 19 '12 at 14:40
  • But note Dinkumware does not mean MSVC... because last I checked, they didn't have any `charNN_t` support. – rubenvb Jun 19 '12 at 14:41
  • @rubenvb I know that MSVC provided the most minimalistic possible support for `charX_t` types since at least 2010 (defining `char16_t` and `char32_t` as typedefs of `unsigned short` and `unsigned int`), but that didn't work properly everywhere. It's at least semi-functional, though, which is useful when trying to port code back to older versions. – Justin Time - Reinstate Monica Apr 01 '17 at 22:46
  • On the plus side, at least they outright admitted that they didn't provide any _actual_ support for the types. On the minus side, not documenting the typedefs likely led people to use `wchar_t` where they didn't actually need to, and it'd be a miracle if it _didn't_ force people to rewrite code that might possibly have functioned as is. – Justin Time - Reinstate Monica Apr 01 '17 at 22:49