9

I have been exploring C++11's new Unicode functionality, and while other C++11 encoding questions have been very helpful, I have a question about the following code snippet from cppreference. The code writes and then immediately reads a text file saved with UTF-8 encoding.

// Write
std::ofstream("text.txt") << u8"z\u6c34\U0001d10b";

// Read
std::wifstream file1("text.txt");
file1.imbue(std::locale("en_US.UTF8"));
std::cout << "Normal read from file (using default UTF-8/UTF-32 codecvt)\n";
for(wchar_t c; file1 >> c; ) // ?
   std::cout << std::hex << std::showbase << c << '\n';

My question is quite simply, why is a wchar_t needed in the for loop? A u8 string literal can be declared using a simple char * and the bit layout of the UTF-8 encoding should tell the system the character's width. It appears there is some automatic conversion from UTF-8 to UTF-32 (hence the wchar_t), but if this is the case, why is the conversion necessary?

Community
  • 1
  • 1
Ephemera
  • 8,672
  • 8
  • 44
  • 84
  • It depends on a lot of things. Notable, correct UTF8 behaviour is extremely hard if not impossible using Windows in a console application (requiring _at least_ a good number of non-standard API calls IIRC) – sehe Mar 18 '13 at 10:57
  • 1
    `wchar_t` is used because `wifstream` is used, and `wifstream` performs that "some automatic conversion" you mention. My point was to show the difference between that automatic conversion (as implemented for one particular platform) and the explicit, portable, locale-independent, Unicode conversion provided by `codecvt_utf8_utf16`. – Cubbi Mar 18 '13 at 14:29

2 Answers2

5

You use wchar_t because you're reading the file using wifstream; if you were reading using ifstream you'd use char, and similarly for char16_t and char32_t.

Assuming (as the example does) that wchar_t is 32-bit, and that the native character set that it represents is UTF-32 (UCS-4), then this is the simplest way to read a file as UTF-32; it is presented as such in the example for contrast to reading a file as UTF-16. A more portable method would be to use basic_ifstream<char32_t> and std::codecvt_utf8<char32_t> explicitly, as this is guaranteed to convert from a UTF-8 input stream to UTF-32 elements.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • 1
    +1, I wrote that example and contrast was what I was going for. – Cubbi Mar 18 '13 at 13:54
  • Ah I see! So is it therefore better practice to always explicitly convert UTF-8 to a wider `wchar_t` or is it still acceptable to just extract the raw UTF-8 bytes into a native `char` array using an `ifstream`? I'm not sure whether to infer from @Cubbi's example that the latter is bad practice, or whether it is just outside the scope of the example. – Ephemera Mar 19 '13 at 00:47
  • @PLPiper yes you can always read whatever multibyte encoding the file has into a char array, without engaging any of the conversions. There isn't a lot that can be done with such array within standard C++ (other than converting to wide first), but plenty of libraries take utf8 inputs. – Cubbi Mar 19 '13 at 02:26
2

The idea of the cppreference code snippet you used is to show how to read a UTF-8 file into a UTF-16 string that's why they write the file using an ofstream but read it using a wifstream (hence the wchar_t).

rlods
  • 465
  • 2
  • 6