Below I will try to print string XЯ
(latin "ex", cyrillic "ya", and Phoenician "teth") to terminals with various encodings, namely utf8, cp1251 and C (POSIX). I expect to see XЯ
in utf8 terminal, XЯ?
in cp1251 terminal, and X??
in C (POSIX) terminal. Question marks are because C++ output library replaces characters which it cannot represent with ?
. This is correct and expected behavior.
(1) My first naive attempt was to just print wide character string to wcout:
wchar_t str[] = L"\U00000058\U0000042f\U00010908";
std::wcout << str << std::endl;
// utf8 terminal output: X??
// cp1251: X??
// C: X??
In all terminals, it printed correctly only the first, ascii7, character. Other characters were replaced with '?' marks. It turned out that this happened because during program startup, LC_ALL is set to C.
(2) Second attempt was to manually call std::setlocale()
with utf8 encoding:
wchar_t str[] = L"\U00000058\U0000042f\U00010908";
std::setlocale(LC_ALL, "en_US.UTF-8");
std::wcout << str << std::endl;
// utf8: XЯ
// cp1251: XЯ𐤈
// C: XЯð¤
Obviously, this worked correctly in utf8 terminal, but resulted in garbage in other two terminals.
(3) Third attempt was to parse $LANG
environment variable for actual encoding used by terminal (and hope that all pieces of the terminal use the same encoding):
const char* lang = std::getenv("LANG");
if (!lang) {
std::cerr << "Couldn't get LANG" << std::endl;
exit(1);
}
wchar_t str[] = L"\U00000058\U0000042f\U00010908";
std::setlocale(LC_ALL, lang);
std::wcout << str << std::endl;
// utf8: XЯ
// cp1251: XЯ?
// C: X??
Now the output in all three terminals was as I expected. However, mixing std::cout
and std::wcout
is a bad idea, and std::cout
is definitely used by some third-party libraries used in my program. This makes std::wcout
unusable.
(4) So, fourth attempt (or, actually, idea) was to detect terminal encoding from $LANG
, use codevct()
to convert wchar_t[]
string into terminal encoding and print it with ordinary std::cout.write()
. Unfortunately, I couldn't find a way to explicitly set target encoding for codevct()
.
(5) Fifth, and so far, the best, attempt was to use iconv()
manually:
// get $LANG env var
const char* lang = std::getenv("LANG");
if (!lang) {
std::cerr << "Couldn't get $LANG" << std::endl;
exit(1);
}
// find out encoding from $LANG, e.g. "utf8", "cp1251", etc
std::string enc(lang);
size_t pos = enc.rfind('.');
if (pos != std::string::npos) {
enc = enc.substr(pos + 1);
}
if (enc == "C" || enc == "POSIX") {
enc = "iso8859-1";
}
// convert wchar_t[] string into terminal encoding
wchar_t str[] = L"\U00000058\U0000042f\U00010908";
iconv_t handler = iconv_open(enc.c_str(), "UTF32LE");
if (handler == (iconv_t)-1) {
std::cerr << "Couldn't create iconv handler: " << strerror(errno) << std::endl;
exit(1);
}
char buf[1024];
char* inbuf = (char*)str;
size_t inbytes = sizeof(str);
char* outbuf = buf;
size_t outbytes = sizeof(buf);
while (true) {
size_t res = iconv(handler, &inbuf, &inbytes, &outbuf, &outbytes);
if (res != (size_t)-1) {
break;
}
if (errno == EILSEQ) {
// replace non-convertable code point with question mark and retry iconv()
inbuf[0] = '\x3f';
inbuf[1] = '\x00';
inbuf[2] = '\x00';
inbuf[3] = '\x00';
} else {
std::cerr << "iconv() failed: %s" << strerror(errno) << std::endl;
exit(1);
}
}
iconv_close(handler);
// write converted string to std::cout
std::cout.write(buf, sizeof(buf) - outbytes);
std::cout << std::endl;
// utf8: XЯ
// cp1251: XЯ?
// C: X??
This worked correctly in all three terminals. And now I am also not afraid that std::cout
is used in other parts of the program. However, I find this solution not C++-way.
So, the question is: what is the correct way to print wide strings in C++? I would be fine with platform-specific solution (Linux + glibc + GCC).