My goal is to convert external input sources to a common, UTF-8 internal encoding, since it is compatible with many libraries I use (such as RE2) and is compact. Since I do not need to do string slicing except with pure ASCII, UTF-8 is an ideal format for me. Now, of the external input formats I should be able to decode includes UTF-16.
In order to test UTF-16 (either big-endian or little-endian) reading in C++, I converted a test UTF-8 file to both UTF-16 LE and UTF-16 BE. The file is simple gibberish in a CSV format, with many different source languages (English, French, Japanese, Korean, Arabic, Emoji, Thai), to create a reasonably complex file:
"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""
UTF-8 Example
Now, parsing this file encoded in UTF-8 with the following code produces the expected output (I understand this example is mostly artificial, since my system encoding is UTF-8, and so no actual conversion to wide characters and then back to bytes is required):
#include <sstream>
#include <locale>
#include <iostream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename, std::ios::binary);
wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
int main()
{
std::wstring read = readFile("utf-8.csv");
std::cout << read.size() << std::endl;
using convert_type = std::codecvt_utf8<wchar_t>;
std::wstring_convert<convert_type, wchar_t> converter;
std::string converted_str = converter.to_bytes( read );
std::cout << converted_str;
return 0;
}
When the file is compiled and run (on Linux, so the system encoding is UTF-8), I get the following output:
$ g++ utf8.cpp -o utf8 -std=c++14
$ ./utf8
73
"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""
UTF-16 Example
However, when I attempt a similar example with UTF-16, I get a truncated file, despite the file loading properly in text editors, Python, etc.
#include <fstream>
#include <sstream>
#include <iostream>
#include <locale>
#include <codecvt>
#include <string>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename, std::ios::binary);
wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
int main()
{
std::wstring read = readFile("utf-16.csv");
std::cout << read.size() << std::endl;
using convert_type = std::codecvt_utf8<wchar_t>;
std::wstring_convert<convert_type, wchar_t> converter;
std::string converted_str = converter.to_bytes( read );
std::cout << converted_str;
return 0;
}
When the file is compiled and run (on Linux, so the system encoding is UTF-8), I get the following output for the little endian format:
$ g++ utf16.cpp -o utf16 -std=c++14
$ ./utf16
19
"This","PO
For the big-endian format, I get the following:
$ g++ utf16.cpp -o utf16 -std=c++14
$ ./utf16
19
"This","OP
Interestingly, the CJK character should be part of the Basic Multilingual Plane, but is clearly not converted properly, and the file is truncated early. The same issue occurs with a line-by-line approach.
Other Resources
I checked the following resources before, most notable this answer, as well as this answer. None of their solutions have proven fruitful for me.
Other Specifics
LANG = en_US.UTF-8
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2)
Any other details and I will be happy to provide them. Thank you.
EDITS
Adrian mentioned in the comments I should provide a hexdump, which is shown for "utf-16le", the little-endian UTF-16-encoded file:
0000000 0022 0054 0068 0069 0073 0022 002c 0022
0000010 4f50 85e4 0020 5e79 592b 0022 002c 0022
0000020 004d 00ea 006d 0065 0073 0022 002c 0022
0000030 ce5c ad6c 0022 000a 0022 0e20 0e04 0e27
0000040 0e32 0022 002c 0022 0020 0643 064a 0628
0000050 0648 0631 062f 0020 0644 0644 0643 062a
0000060 0627 0628 0629 0020 0628 0627 0644 0639
0000070 0631 0628 064a 0022 002c 0022 30a6 30a5
0000080 30ad 30e5 002c 0022 002c 0022 d83d dec2
0000090 0022 000a
0000094
qexyn mentioned removing the std::ios::binary
flag, which I attempted but changed nothing.
Finally, I attempted using iconv to see if these were valid files, using both the command-line utility and the C-module.
$ iconv -f="UTF-16BE" -t="UTF-8" utf-16be.csv
"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""
Clearly, iconv has no issue with the source files. This is leading me to use iconv, since it's cross-platform, easy-to-use, and well-tested, but if anyone has an answer with the standard library, I will gladly accept it.