6

Today I have discovered, that C++ standard committee has dismissed Unicode streams support in C++0x in second revision. Fore more information see this question.

According this document:

The rationale for leaving out stream specializations of the two new types was that streams of non-char types have not attracted wide usage, so it is not clear that there is a real need for doubling the number of specializations of this very complicated machinery.

From this interview with Stroustrup:

Obviously, we ought to have Unicode streams and other much extended Unicode support in the standard library. The committee knew that but didn't have anyone with the skills and time to do the work, so unfortunately, this is one of the many areas where you have to look for "third party" support.

I'm not expert in Unicode, and I'm wondering why implementing Unicode streams is so difficult? What is so problematic with it?

Community
  • 1
  • 1
UmmaGumma
  • 5,633
  • 1
  • 31
  • 45

2 Answers2

5

The first paragraph you cited tells you: it's not that Unicode streams in particular are more difficult than other streams, it's that iostreams in general are extremely complicated. Thus, implementing Unicode iostreams is difficult not because they are Unicode, but because they are iostreams.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Notably implementing unicode streams implies to implement a large part of locale for those character types. – AProgrammer Apr 14 '11 at 17:35
  • 2
    Also, Unicode is an abstract method for connecting characters with numbers called "code points". There are different ways of actually doing the encoding, like UTF-8 and UTF-16 (to name probably the two most popular). In an actual Standards document or implementation, you can't just say "Unicode" and get away with it, you have to pick one or more encodings. – David Thornley Apr 14 '11 at 17:42
  • There is nothing new to fundamentally add to iostreams. `codecvt` was designed to support multibyte encodings from the beginning, and UTF-8 support has long been around. I'm not sure what Stroustrup is saying there, Unicode iostreams are alive and well. – Potatoswatter Apr 14 '11 at 18:34
3

The paper N2238 is from 2007 and has no relevance. I'm not sure what Stroustrup is specifically referring to in the interview, but that isn't breaking news.

N3242 §22.5 still requires codecvt_utf8 and codecvt_utf16, which are all you need for Unicode file I/O. imbue the proper facet onto wcout and should be good to go… assuming you have a compliant library. However, in practice, GCC and MSVC already supply UTF-8, and I would expect that every serious C++ platform keeps parity between mbstowcs and codecvt.

There may be confusion because N3242 §22.5/5 says

— The multibyte sequences may be written only as a binary file. Attempting to write to a text file produces undefined behavior.

This is because text mode I/O converts line endings, so a 0x10 byte as half of a 16-bit UTF-16 word could be converted to 0x13, 0x10, corrupting the stream. This has nothing to do with poor support… just be sure to open up the file in binary mode, as you must with any library providing such functionality.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • But with `codecvt_utf16` you can't save file as Unicode. So it's not complete solution. – UmmaGumma Apr 14 '11 at 18:34
  • @Ashot: What are you talking about? It exists to save files in UTF-16 format. What else would it do? – Potatoswatter Apr 14 '11 at 18:35
  • I still don't get it. How can I write something to file and save it in unicode mode with `codecvt_utf8`? Thanks – UmmaGumma Apr 14 '11 at 18:40
  • @Ashot: *If* your platform implements `codecvt_utf8`, use `cout.imbue( new codecvt_utf8< wchar_t > )` (or replace `cout` with `my_stream`). If it doesn't, then you should open a named locale. That's a separate question, please start a new page for that. – Potatoswatter Apr 14 '11 at 18:45
  • Oops, that should be `stream.imbue( locale( locale(), new codecvt…`, or there are many other approaches. Yes, highly specific tasks are a pain. No, the support is not being reduced. – Potatoswatter Apr 14 '11 at 19:01