2

Below I will try to print string (latin "ex", cyrillic "ya", and Phoenician "teth") to terminals with various encodings, namely utf8, cp1251 and C (POSIX). I expect to see in utf8 terminal, XЯ? in cp1251 terminal, and X?? in C (POSIX) terminal. Question marks are because C++ output library replaces characters which it cannot represent with ?. This is correct and expected behavior.

(1) My first naive attempt was to just print wide character string to wcout:

wchar_t str[] = L"\U00000058\U0000042f\U00010908";
std::wcout << str << std::endl;
// utf8 terminal output: X??
// cp1251: X??
// C: X??

In all terminals, it printed correctly only the first, ascii7, character. Other characters were replaced with '?' marks. It turned out that this happened because during program startup, LC_ALL is set to C.

(2) Second attempt was to manually call std::setlocale() with utf8 encoding:

wchar_t str[] = L"\U00000058\U0000042f\U00010908";
std::setlocale(LC_ALL, "en_US.UTF-8");
std::wcout << str << std::endl;
// utf8: XЯ
// cp1251: XЯ𐤈
// C: XЯð¤

Obviously, this worked correctly in utf8 terminal, but resulted in garbage in other two terminals.

(3) Third attempt was to parse $LANG environment variable for actual encoding used by terminal (and hope that all pieces of the terminal use the same encoding):

const char* lang = std::getenv("LANG");
if (!lang) {
  std::cerr << "Couldn't get LANG" << std::endl;
  exit(1);
}

wchar_t str[] = L"\U00000058\U0000042f\U00010908";
std::setlocale(LC_ALL, lang);
std::wcout << str << std::endl;
// utf8: XЯ
// cp1251: XЯ?
// C: X??

Now the output in all three terminals was as I expected. However, mixing std::cout and std::wcout is a bad idea, and std::cout is definitely used by some third-party libraries used in my program. This makes std::wcout unusable.

(4) So, fourth attempt (or, actually, idea) was to detect terminal encoding from $LANG, use codevct() to convert wchar_t[] string into terminal encoding and print it with ordinary std::cout.write(). Unfortunately, I couldn't find a way to explicitly set target encoding for codevct().

(5) Fifth, and so far, the best, attempt was to use iconv() manually:

// get $LANG env var
const char* lang = std::getenv("LANG");
if (!lang) {
  std::cerr << "Couldn't get $LANG" << std::endl;
  exit(1);
}

// find out encoding from $LANG, e.g. "utf8", "cp1251", etc
std::string enc(lang);
size_t pos = enc.rfind('.');
if (pos != std::string::npos) {
  enc = enc.substr(pos + 1);
}
if (enc == "C" || enc == "POSIX") {
  enc = "iso8859-1";
}

// convert wchar_t[] string into terminal encoding
wchar_t str[] = L"\U00000058\U0000042f\U00010908";
iconv_t handler = iconv_open(enc.c_str(), "UTF32LE");
if (handler == (iconv_t)-1) {
  std::cerr << "Couldn't create iconv handler: " << strerror(errno) << std::endl;
  exit(1);
}

char buf[1024];

char* inbuf = (char*)str;
size_t inbytes = sizeof(str);
char* outbuf = buf;
size_t outbytes = sizeof(buf);

while (true) {
  size_t res = iconv(handler, &inbuf, &inbytes, &outbuf, &outbytes);
  if (res != (size_t)-1) {
    break;
  }
  if (errno == EILSEQ) {
    // replace non-convertable code point with question mark and retry iconv()
    inbuf[0] = '\x3f';
    inbuf[1] = '\x00';
    inbuf[2] = '\x00';
    inbuf[3] = '\x00';
  } else {
    std::cerr << "iconv() failed: %s" << strerror(errno) << std::endl;
    exit(1);
  }
}
iconv_close(handler);

// write converted string to std::cout
std::cout.write(buf, sizeof(buf) - outbytes);
std::cout << std::endl;
// utf8: XЯ
// cp1251: XЯ?
// C: X??

This worked correctly in all three terminals. And now I am also not afraid that std::cout is used in other parts of the program. However, I find this solution not C++-way.

So, the question is: what is the correct way to print wide strings in C++? I would be fine with platform-specific solution (Linux + glibc + GCC).

gudok
  • 4,029
  • 2
  • 20
  • 30
  • 1
    You’re supposed to call `setlocale(LC_ALL,"")` to initialize the locale from the environment (not just `$LANG`). That doesn’t fix the `wcout` issue, of course. – Davis Herring Jan 05 '19 at 14:26
  • Thanks, by the way: I’ll add this to my list of reasons why libraries must not write to the standard streams. (It would be easy for you to provide a `char`-based wrapper for `std::wcout` if the library relied on a stream you provided.) – Davis Herring Jan 05 '19 at 15:02

0 Answers0