How to read a UCS-2 file?

Question

I'm writing a program to get the infomation in *.rc file encoding in UCS-2 Little Endian.

int _tmain(int argc, _TCHAR* argv[]) {
  wstring csvLine(wstring sLine);
  wifstream fin("en.rc");
  wofstream fout("table.csv");
  wofstream fout_rm("temp.txt");
  wstring sLine;
  fout << "en\n";
  while(getline(fin,sLine)) {
    if (sLine.find(L"IDS") == -1)
      fout_rm << sLine << endl;
    else
      fout << csvLine(sLine);
  }
  fout << flush;
  system("pause");
  return 0;
}

The first line in "en.rc" is #include <windows.h> but sLine shows as below:

[0]     255 L'ÿ'
[1]     254 L'þ'
[2]     35  L'#'
[3]     0
[4]     105 L'i'
[5]     0
[6]     110 L'n'
[7]     0
[8]     99  L'c'
.       .
.       .
.       .

This program can work out correctly for UTF-8. How can I do it to UCS-2?

Your example code won't even compile since it's using a variable `fout_rm` which is not declared. — Some programmer dude, Jul 25 '12 at 06:39
I missed the declaration line when I pasted it.the code has been updated — goss.beta, Jul 25 '12 at 06:48
REad this: http://www.codeproject.com/Articles/38242/Reading-UTF-8-with-C-streams — Remus Rusanu, Jul 25 '12 at 06:58
Incidentally, this code does *not* work for UTF-8 input (nor does similar code). You probably just got lucky because you only used characters within the first 127 code points. C++ streams cannot decode different encodings, they are completely encoding-agnostic. — Konrad Rudolph, Jul 25 '12 at 07:00
It looks like your "UCS-2 file" is actually a UTF-16 file with [a byte-order marker](https://en.wikipedia.org/wiki/Byte_order_mark). — Joachim Sauer, Jul 25 '12 at 07:55

score 10 · Accepted Answer · edited May 23 '17 at 11:51

Wide streams use a wide stream buffer to access the file. The Wide stream buffer reads bytes from the file and uses its codecvt facet to convert these bytes to wide characters. The default codecvt facet is std::codecvt<wchar_t, char ,std::mbstate_t> which converts between the native character sets for wchar_t and char (i.e., like mbstowcs() does).

You're not using the native char character set, so what you want is a codecvt facet that reads UCS-2 as a multibyte sequence and converts it to wide characters.

#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>

int main(int argc, char *argv[])
{
    wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode

    // Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
    fin.imbue(std::locale(fin.getloc(),
              new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));

    // ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
    //   We use consume_header to detect and use the UTF-16 'BOM'

    // The following is not really the correct way to write Unicode output, but it's easy
    std::wstring sLine;
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
    while (getline(fin, sLine))
    {
        std::cout << convert.to_bytes(sLine) << '\n';
    }
}

Note that there's an issue with UTF-16 here. The purpose of wchar_t is for one wchar_t to represent one codepoint. However Windows uses UTF-16 which represents some codepoints as two wchar_ts. This means that the standard API doesn't work very well with Windows.

The consequence here is that when the file contains a surrogate pair, codecvt_utf16 will read that pair, convert it to a single codepoint value greater than 16 bits and have to truncate the value to 16 bits to stick it in a wchar_t. This means this code really is limited to UCS-2. I've set the maxcode template parameter to 0xFFFF to reflect this.

There are a number of other problems with wchar_t, and you might want to just avoid it entirely: What's “wrong” with C++ wchar_t?

score 0 · Answer 2 · answered Nov 24 '20 at 08:56

#include <filesystem>
namespace fs = std::filesystem;

    FILE* f = _wfopen(L"myfile.txt", L"rb");
    auto file_size = fs::file_size(filename);
std::wstring buf;       
buf.resize((size_t)file_size / sizeof(decltype(buf)::value_type));// buf in my code is a template object, so I use decltype(buf) to decide its type.
    fread(&buf[0], 1, 2, f); // escape UCS2 BOM
    fread(&buf[0], 1, file_size, f);

How to read a UCS-2 file?

2 Answers2

Linked