1

I want to read lines of Unicode text (UTF-16 LE, line feed delimited) from a file. I'm using Visual Studio 2012 and targeting a 32-bit console application.

I was not able to find a ReadLine function within WinAPI so I turned to Google. It is clear I am not the first to seek such a function. The most commonly recommended solution involves using std::wifstream.

I wrote code similar to the following:

wchar_t buffer[1024];
std::wifstream input(L"input.txt");

while (input.good())
{
    input::getline(buffer, 1024);
    // ... do stuff...
}

input.close();

For the sake of explanation, assume that input.txt contains two UTF-16 LE lines which are less than 200 wchar_t chars in length.

Prior to calling getline the first time, Visual Studio correctly identifies that buffer is an array of wchar_t. You can mouse over the variable in the debugger and see that the array is comprised of 16-bit values. However, after the call to getline returns, the debugger now displays buffer as if is a byte array.

After the first call to getline, the contents of buffer are correct (aside from buffer being treated like a byte array). If the first line of input.txt contains the UTF-16 string L"123", this is correctly stored in buffer as (hex) "31 00 32 00 33 00"

My first thought was to reinterpret_cast<wchar_t *>(buffer) which does produce the desired result (buffer is now treated like a wchar_t array) and it contains the values I expect.

However, after the second call to getline, (the second line of input.txt contains the string L"456") buffer contains (hex) "00 34 00 35 00 36 00". Note that this is incorrect (it should be [hex] 34 00 35 00 36 00)

The fact that the byte ordering gets messed up prevents me from using reinterpret_cast as a solution to work around this. More importantly, why is std::wifstream::getline even converting my wchar_t buffer into a char buffer anyways?? I was under the impression that if one wanted to use chars they would use ifstream and if they want to use wchar_t they use wifstream...

I am terrible at making sense of the stl headers, but it almost looks as if wifstream is intentionally converting my wchar_t to a char... why??

I would appreciate any insights and explanations for understanding these problems.

charunnera
  • 357
  • 4
  • 16

1 Answers1

8

wifstream reads bytes from the file, and converts them to wide chars using codecvt facet installed into the stream's locale. The default facet assumes system-default code page and calls mbstowcs on those bytes.

To treat your file as UTF-16, you need to use codecvt_utf16. Like this:

std::wifstream fin("text.txt", std::ios::binary);
// apply facet
fin.imbue(std::locale(fin.getloc(),
          new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));
Igor Tandetnik
  • 50,461
  • 4
  • 56
  • 85
  • I had never heard of imbuing a stream, but indeed that provided the solution to the issues I was having. It has, however, sparked another question which I will post on SO. Thank you! – charunnera Nov 01 '13 at 01:06
  • Why do you open the file in the binary mode? – hkBattousai Apr 01 '16 at 09:20
  • @hkBattousai So that, say, U+0D0A "MALAYALAM LETTER UU" doesn't get helpfully converted to `\n`. – Igor Tandetnik Apr 01 '16 at 13:05
  • Do you mean that the 0x000A character will remain as the first character of the next line? Can't we check for it after reading each line, and erase it if it exist? Reading the file in the binary mode and parsing the lines is really a pain in the neck. I would like to use the text mode if possible. – hkBattousai Apr 01 '16 at 18:12
  • @hkBattousai In a stream of bytes that represent UTF-16 encoded text, bytes 0x0D and 0x0A may occasionally end up next to each other, either in the form of U+0D0A or U+0A0D characters (depending on endianness), or on the boundary between two characters. I suspect that a stream opened in text mode would, upon encountering such a pair of bytes, convert them to a single 0x0A byte **before** handing the bytes over to the codec. That would mess everything up. (continued) – Igor Tandetnik Apr 01 '16 at 20:23
  • @hkBattousai Even if that doesn't happen, I'm pretty sure that a sequence of Unicode codepoints `U+000D U+000A` will **not** be automatically converted to a single codepoint `U+000A` by a stream opened in text mode. Therefore, trying to use the text mode would be pointless at best and dangerous at worst. Of course, if you care sufficiently strongly, you can always test it and see if my concerns turn out to be unfounded. – Igor Tandetnik Apr 01 '16 at 20:25
  • @hkBattousai In any case, I'm not sure what you mean by "parsing the lines". The only difference (in the "normal", ASCII-only case) between text mode and binary mode is that the sequence `\r\n` gets automatically converted to a single byte `\n` on input, and vice versa on output (this is on Windows; on Linux, say, there's no difference between these two modes at all). There's nothing preventing you from calling, say, `getline` on a file opened in binary mode - but the line may have a trailing `\r` at the end. – Igor Tandetnik Apr 01 '16 at 20:31
  • I get it now. Thank you for the information. It helped a lot. – hkBattousai Apr 02 '16 at 03:33