0

I have a file in UTF-16 (or UCS-2, doesn't really matter since it is UTF-16 LE as far as I know) encoding which I have downloaded from here: http://www.humancomp.org

I'd like to read the contents of that file into a std::wstring, which is my first problem: I haven't been able to read the file correcly yet. The read data always seems to be messed up.

Secondly, I'd like to compare the read std::wstring to a const wchar_t* string literal. And here, I am experiencing my second problem: How do I specify the wchar_t content via hex values?

The file which I want to turn into a const wchar_t* string literal has the following bytes (copied out of a hex editor)

FE FF 05 31 05 65 05 81 05 65 05 70 05 6B 00 20 05 6B 05 74 00 20 05 6C 05 61 05 7E 00 20 00 3F 05 82 05 72 05 6B 05 65 00 20 05 6C 05 61 05 7E 05 61 05 80 05 61 05 80 00 2C 00 0D 00 0A 05 3F 05 75 05 61 05 65 05 62 05 7D 00 20 05 79 05 7F 05 61 05 75 05 6B 00 20 05 6F 05 61 05 7D 05 6F 05 61 05 6E 05 6B 00 20 05 74 05 70 05 63 05 6B 05 65 00 2E 00 2E 00 2E 00 0D 00 0A 05 31 05 75 05 65 05 7A 05 70 05 7D 00 20 05 6F 00 3F 05 82 05 66 05 70 05 6B 00 20 05 74 05 70 05 6F 05 65 00 20 05 6B 05 65 05 6E 00 20 00 3F 05 61 05 7E 05 61 05 7F 05 80 00 2C 00 0D 00 0A 05 31 05 75 05 65 05 7A 05 70 05 7D 00 20 05 6F 00 3F 05 82 05 66 05 70 05 6B 00 20 00 3F 05 61 05 7E 05 61 05 7F 05 61 05 6C 00 20 05 74 05 70 05 6F 05 6B 05 65 05 89

Of course, I can't initialize the string literal with that. I tried to turn it into hex values and apply a reinterpret_cast to get a const wchar_t*

reinterpret_cast<const wchar_t*>("\xFE\xFF\x05\x31\x05\x65\x05\x81\x05\x65\x05\x70\x05\x6B\x00\x20\x05\x6B\x05\x74\x00\x20\x05\x6C\x05\x61\x05\x7E\x00\x20\x00\x3F\x05\x82\x05\x72\x05\x6B\x05\x65\x00\x20\x05\x6C\x05\x61\x05\x7E\x05\x61\x05\x80\x05\x61\x05\x80\x00\x2C\x00\x0D\x00\x0A\x05\x3F\x05\x75\x05\x61\x05\x65\x05\x62\x05\x7D\x00\x20\x05\x79\x05\x7F\x05\x61\x05\x75\x05\x6B\x00\x20\x05\x6F\x05\x61\x05\x7D\x05\x6F\x05\x61\x05\x6E\x05\x6B\x00\x20\x05\x74\x05\x70\x05\x63\x05\x6B\x05\x65\x00\x2E\x00\x2E\x00\x2E\x00\x0D\x00\x0A\x05\x31\x05\x75\x05\x65\x05\x7A\x05\x70\x05\x7D\x00\x20\x05\x6F\x00\x3F\x05\x82\x05\x66\x05\x70\x05\x6B\x00\x20\x05\x74\x05\x70\x05\x6F\x05\x65\x00\x20\x05\x6B\x05\x65\x05\x6E\x00\x20\x00\x3F\x05\x61\x05\x7E\x05\x61\x05\x7F\x05\x80\x00\x2C\x00\x0D\x00\x0A\x05\x31\x05\x75\x05\x65\x05\x7A\x05\x70\x05\x7D\x00\x20\x05\x6F\x00\x3F\x05\x82\x05\x66\x05\x70\x05\x6B\x00\x20\x00\x3F\x05\x61\x05\x7E\x05\x61\x05\x7F\x05\x61\x05\x6C\x00\x20\x05\x74\x05\x70\x05\x6F\x05\x6B\x05\x65\x05\x89");

but this doesn't work. It gives me bogus data.

I've also tried to create a wchar_t string literal directly:

L"\xFEFF\x0531\x0565\x0581\x0565\x0570\x056B\x0020\x056B\x0574\x0020\x056C\x0561\x057E\x0020\x003F\x0582\x0572\x056B\x0565\x0020\x056C\x0561\x057E\x0561\x0580\x0561\x0580\x002C\x000D\x000A\x053F\x0575\x0561\x0565\x0562\x057D\x0020\x0579\x057F\x0561\x0575\x056B\x0020\x056F\x0561\x057D\x056F\x0561\x056E\x056B\x0020\x0574\x0570\x0563\x056B\x0565\x002E\x002E\x002E\x000D\x000A\x0531\x0575\x0565\x057A\x0570\x057D\x0020\x056F\x003F\x0582\x0566\x0570\x056B\x0020\x0574\x0570\x056F\x0565\x0020\x056B\x0565\x056E\x0020\x003F\x0561\x057E\x0561\x057F\x0580\x002C\x000D\x000A\x0531\x0575\x0565\x057A\x0570\x057D\x0020\x056F\x003F\x0582\x0566\x0570\x056B\x0020\x003F\x0561\x057E\x0561\x057F\x0561\x056C\x0020\x0574\x0570\x056F\x056B\x0565\x0589"

This, again, ends up in bogus data. I'm not even sure if this is the correct way of specifying wchar_t data - combining 2 bytes?

08Dc91wk
  • 4,254
  • 8
  • 34
  • 67
j00hi
  • 5,420
  • 3
  • 45
  • 82
  • 2
    The bytes of your file are in **UTF-16BE** (as evident by the presence of the UTF-16BE BOM). If your string literal is in **UTF-16LE** instead, you will have to do a conversion before you can compare them. Your `reinterpret_cast` for the raw literal bytes is fine, except that you get garbage at the end because you are not including a null terminator in UTF-16: `\x00\x00`. Your `L"..."` literal is null terminated correctly. – Remy Lebeau Jul 21 '15 at 23:47
  • To read a UTF-16BE encoded file into a `std::wstring`, use a `std::wifstream` that has been `imbue()`'ed with a `std::locale` object that represents UTF-16BE. If you are using C++11, you can create a `std::locale` that uses the `std::codecvt_utf16` class with its `std::consume_header` flag enabled so it will account for the BOM. – Remy Lebeau Jul 21 '15 at 23:54

1 Answers1

0

Here is the solution which was achieved with the help of the comment by Remy Lebeau:

// BOM: \xFEFF
auto utf16raw = L"\x0531\x0565\x0581\x0565\x0570\x056B\x0020\x056B\x0574\x0020\x056C\x0561\x057E\x0020\x003F\x0582\x0572\x056B";
std::wstring utf16str{utf16raw};

The BOM must be left out of the string. The UTF-16 string, utf16str can be converted into an UTF-8 encoded string (and vice-versa) with the UTF-8 CPP library available on Sourceforge, for instance.

j00hi
  • 5,420
  • 3
  • 45
  • 82
  • 1
    If C++11 or above is allowed, the standard library alone is all that is needed to convert between UTF-8 and UTF-16. See http://stackoverflow.com/a/18597384/6345 for reference. – Johann Gerell Aug 03 '15 at 12:45