2

Does GCC's standard library or Boost or any other library implement iostream-compliant versions of ifstream or ofstream that supports conversion between UTF-8-encoded (file-) streams and a std::vector<wchar_t> or std::wstring?

Nordlöw
  • 11,838
  • 10
  • 52
  • 99

2 Answers2

4

The C++11 solution is to wrap the UTF-8 stream in an appropriate wbuffer_convert

#include <fstream>
#include <string>
#include <codecvt>
int main()
{
    std::ifstream utf8file("test.txt"); // if the file holds UTF-8 data
    std::wbuffer_convert<std::codecvt_utf8<wchar_t>> conv(utf8file.rdbuf());
    std::wistream ucsbuf(&conv);
    std::wstring line;
    getline(ucsbuf, line); // then line holds UCS2 or UCS4, depending on the OS
}

This works with Visual Studio 2010 and with clang++/libc++, but, unfortunately, not with GCC.

Until this becomes widespread, third-party libraries are indeed the best solution.

Cubbi
  • 46,567
  • 13
  • 103
  • 169
  • Does this convert UTF8 to WCHAR, or only to other UTFs? It'd be very nice if UTF<->WCHAR conversion were part of the standard! – Kerrek SB Oct 25 '11 at 15:58
  • @KerrekSB: Indeed, locale-independent codecvt's produce only UTF-16, UCS2, or UTF-32. Not the wctombs-compatible opaque wide character encoding. – Cubbi Oct 25 '11 at 16:05
2

Your question doesn't quite work. UTF-8 is a specific encoding, while wchar_t is a data type. Moreover, wchar_t is intended by the standard to represent the system's character set, but this is entirely left to platform, and the standard makes no requirements.

Therefore, the correct thing to ask for is first of all conversion between the system's narrow, multibyte encoding and the fixed-length encoding of the system's encoding into a wide string. This functionality is provided by std::mbstowcs and std::wcstombs. There may also be a locale facet somewhere that wraps this, but that's a bit of a niche area of the library.

If you want to convert between the opaque "system's encoding" prescribed by the standard and a definite encoding prescribed by your serialized data source/sink, you need an extra library. I'd recommend Posix's iconv(), which is widely available. (The Windows API has a different approach and offers special functions for conversion.)

C++11 alleviates the issue slightly by adding an explicit family of UTF-encoded string types and literals, and presumably also transcoding facilities among those (though I've never seen them implemented by anyone).

Here's my standard response of past posts on the subject: Q1, Q2, Q3. C++11 will be a joy once its fully available :-)

Community
  • 1
  • 1
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • 2
    C++11's explicit UTF8/UTF16/UCS2/UCS4 conversions have been implemented in clang++/libc++ and in Visual Studio 2010 for quite a while now. The following runs on both: https://ideone.com/hywz6 – Cubbi Oct 25 '11 at 13:08
  • @Cubbi: Thanks! What about `` (see my Q3)? – Kerrek SB Oct 25 '11 at 13:12
  • Those I haven't had a chance to try out. [dunkumware says](http://www.dinkumware.com/manuals/?manual=compleat&page=uchar.html#mbrtoc16) it has them. – Cubbi Oct 25 '11 at 13:29