I have a file in UTF-16 (or UCS-2, doesn't really matter since it is UTF-16 LE as far as I know) encoding which I have downloaded from here: http://www.humancomp.org
I'd like to read the contents of that file into a std::wstring
, which is my first problem: I haven't been able to read the file correcly yet. The read data always seems to be messed up.
Secondly, I'd like to compare the read std::wstring
to a const wchar_t*
string literal. And here, I am experiencing my second problem: How do I specify the wchar_t
content via hex values?
The file which I want to turn into a const wchar_t*
string literal has the following bytes (copied out of a hex editor)
FE FF 05 31 05 65 05 81 05 65 05 70 05 6B 00 20 05 6B 05 74 00 20 05 6C 05 61 05 7E 00 20 00 3F 05 82 05 72 05 6B 05 65 00 20 05 6C 05 61 05 7E 05 61 05 80 05 61 05 80 00 2C 00 0D 00 0A 05 3F 05 75 05 61 05 65 05 62 05 7D 00 20 05 79 05 7F 05 61 05 75 05 6B 00 20 05 6F 05 61 05 7D 05 6F 05 61 05 6E 05 6B 00 20 05 74 05 70 05 63 05 6B 05 65 00 2E 00 2E 00 2E 00 0D 00 0A 05 31 05 75 05 65 05 7A 05 70 05 7D 00 20 05 6F 00 3F 05 82 05 66 05 70 05 6B 00 20 05 74 05 70 05 6F 05 65 00 20 05 6B 05 65 05 6E 00 20 00 3F 05 61 05 7E 05 61 05 7F 05 80 00 2C 00 0D 00 0A 05 31 05 75 05 65 05 7A 05 70 05 7D 00 20 05 6F 00 3F 05 82 05 66 05 70 05 6B 00 20 00 3F 05 61 05 7E 05 61 05 7F 05 61 05 6C 00 20 05 74 05 70 05 6F 05 6B 05 65 05 89
Of course, I can't initialize the string literal with that. I tried to turn it into hex values and apply a reinterpret_cast
to get a const wchar_t*
reinterpret_cast<const wchar_t*>("\xFE\xFF\x05\x31\x05\x65\x05\x81\x05\x65\x05\x70\x05\x6B\x00\x20\x05\x6B\x05\x74\x00\x20\x05\x6C\x05\x61\x05\x7E\x00\x20\x00\x3F\x05\x82\x05\x72\x05\x6B\x05\x65\x00\x20\x05\x6C\x05\x61\x05\x7E\x05\x61\x05\x80\x05\x61\x05\x80\x00\x2C\x00\x0D\x00\x0A\x05\x3F\x05\x75\x05\x61\x05\x65\x05\x62\x05\x7D\x00\x20\x05\x79\x05\x7F\x05\x61\x05\x75\x05\x6B\x00\x20\x05\x6F\x05\x61\x05\x7D\x05\x6F\x05\x61\x05\x6E\x05\x6B\x00\x20\x05\x74\x05\x70\x05\x63\x05\x6B\x05\x65\x00\x2E\x00\x2E\x00\x2E\x00\x0D\x00\x0A\x05\x31\x05\x75\x05\x65\x05\x7A\x05\x70\x05\x7D\x00\x20\x05\x6F\x00\x3F\x05\x82\x05\x66\x05\x70\x05\x6B\x00\x20\x05\x74\x05\x70\x05\x6F\x05\x65\x00\x20\x05\x6B\x05\x65\x05\x6E\x00\x20\x00\x3F\x05\x61\x05\x7E\x05\x61\x05\x7F\x05\x80\x00\x2C\x00\x0D\x00\x0A\x05\x31\x05\x75\x05\x65\x05\x7A\x05\x70\x05\x7D\x00\x20\x05\x6F\x00\x3F\x05\x82\x05\x66\x05\x70\x05\x6B\x00\x20\x00\x3F\x05\x61\x05\x7E\x05\x61\x05\x7F\x05\x61\x05\x6C\x00\x20\x05\x74\x05\x70\x05\x6F\x05\x6B\x05\x65\x05\x89");
but this doesn't work. It gives me bogus data.
I've also tried to create a wchar_t
string literal directly:
L"\xFEFF\x0531\x0565\x0581\x0565\x0570\x056B\x0020\x056B\x0574\x0020\x056C\x0561\x057E\x0020\x003F\x0582\x0572\x056B\x0565\x0020\x056C\x0561\x057E\x0561\x0580\x0561\x0580\x002C\x000D\x000A\x053F\x0575\x0561\x0565\x0562\x057D\x0020\x0579\x057F\x0561\x0575\x056B\x0020\x056F\x0561\x057D\x056F\x0561\x056E\x056B\x0020\x0574\x0570\x0563\x056B\x0565\x002E\x002E\x002E\x000D\x000A\x0531\x0575\x0565\x057A\x0570\x057D\x0020\x056F\x003F\x0582\x0566\x0570\x056B\x0020\x0574\x0570\x056F\x0565\x0020\x056B\x0565\x056E\x0020\x003F\x0561\x057E\x0561\x057F\x0580\x002C\x000D\x000A\x0531\x0575\x0565\x057A\x0570\x057D\x0020\x056F\x003F\x0582\x0566\x0570\x056B\x0020\x003F\x0561\x057E\x0561\x057F\x0561\x056C\x0020\x0574\x0570\x056F\x056B\x0565\x0589"
This, again, ends up in bogus data. I'm not even sure if this is the correct way of specifying wchar_t
data - combining 2 bytes?