2

I have very little understanding of C++ streams and its handling of Unicode, trying to understand why code someone else wrote behaves in this way. I'd be very grateful if someone could explain to me what is going on.


MCVE:

#include <string>
#include <iostream>

int main() {
  std::basic_string<wchar_t> line;
  std::locale::global(std::locale("")); // This
  std::wcout.imbue(std::locale(""));    // This
  std::wcin.imbue(std::locale(""));     // This
  for (;;) {
    std::getline(std::wcin, line);
    if (std::wcin.eof()) {
      std::wcout << L"EOF" << std::endl;
      break;
    }
    std::wcout << line << std::endl;
  }
}

Sample input test.txt:

( ) ライン
second line

EDIT: Hexdump of test.txt:

$ xxd test.txt
00000000: 2820 2920 e383 a9e3 82a4 e383 b30a 7365  ( ) ..........se
00000010: 636f 6e64 206c 696e 650a                 cond line.

Results

On a CentOS server, this is the result (1):

$ ./a.out < test.txt
( ) ライン
second line
EOF

On my Mac though (2):

$ ./a.out < test.txt
( )  EOF

If I comment out the three marked locale lines, Redhat outputs (3):

$ ./a.out < test.txt
EOF

while Mac outputs (4):

$ ./a.out < test.txt
( ) ライン
second line
EOF

Questions

  • Why does the second (2) result detect EOF mid-line? Where does the second space before EOF come from? (This result baffles me the most.)
  • Why does the third (3) result detect EOF immediately?
  • Most importantly: What to do to always consistently get the first (1) or last result (4)?

Environment

Here is the environment for both machines:

CentOS Linux release 7.5.1804 (Core):

$ c++ --version
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

macOS Big Sur (version 11.6):

$ c++ --version
Apple clang version 12.0.5 (clang-1205.0.22.11)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

Bonus

One additional puzzle. If I change the input to this (i.e. just add two more spaces inside the parentheses):

(   ) ライン
second line

the original (uncommented) code outputs this on Mac:

$ ./a.out < test.txt
(   )   ララライ翕翕ン
second line
EOF

Those are not artifacts of a messed-up terminal; all those extra characters are actually there:

$ ./a.out < test.txt | xxd
00000000: 2820 2020 2920 2020 e383 a9e3 83a9 e383  (   )   ........
00000010: a9e3 82a4 e7bf b7e7 bfb7 e383 b30a 7365  ..............se
00000020: 636f 6e64 206c 696e 650a 454f 460a       cond line.EOF.

Like... what?

EDIT In response to Giacomo Catenazzi's comment, I changed EOF printing from char to wide, which did fix one weirdness regarding input. My core issue is with reading wcin though, which proves to be unrelated.


EDIT Difference between std::getline and std::wcin.get

Here is the data obtained by getline. In this case, I don't get EOF, but the data is still weird:

std::wcout.imbue(std::locale("C")); // prevent commas
for (;;) {
  std::getline(std::wcin, line);
  if (std::wcin.eof()) {
    std::wcout << L"EOF" << std::endl;
    break;
  }
  int i, l = line.length();
  for (i = 0; i < l; i++) {
    wchar_t ch = line.at(i);
    std::wcout << std::hex << (int) ch << L" ";
  }
  std::wcout << std::endl;
}

Output:

28 20 29 20 20 0 30e9 30e9 30e9 30a4 30a4 30a4 30f3 
73 65 63 6f 6e 64 20 6c 69 6e 65 
EOF

Where does the 0 come from? What's with the repeated characters? The characters following the 0 translate to ララライイイン. (Note that here I do not try to output the received characters to wcout, only the numeric values, in order to eliminate any possible effects of output encoding.)

The data obtained by get is different, but no less strange:

// ...
std::wcout.imbue(std::locale("C")); // prevent commas
for (;;) {
  wchar_t ch = std::wcin.get();
  if (std::wcin.eof()) {
    std::wcout << L"EOF" << std::endl;
    break;
  }
  std::wcout << std::hex << (int) ch << L" ";
  if (std::char_traits<wchar_t>::eq(ch, std::wcin.widen('\n'))) {
    std::wcout << std::endl;
  }
}

Output:

28 20 29 20 7ffe 7ffe 30e9 7ffe 7ffe 30a4 7ffe 7ffe 30f3 a 
73 65 63 6f 6e 64 20 6c 69 6e 65 a 
EOF

This translates to 翾翾ラ翾翾イ翾翾ン. Where do those 7ffe characters come from?

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • subtle duplicate of https://stackoverflow.com/questions/8947949/mixing-cout-and-wcout-in-same-program ? – Giacomo Catenazzi Oct 22 '21 at 06:58
  • @GiacomoCatenazzi Thanks. Changing to `std::wcout << L"EOF" << std::endl;` fixes (1). However it has no effect on (3), or on (2) which is my core issue. I will edit it in for sharper focus. – Amadan Oct 22 '21 at 07:08
  • Are you using a UTF-8 capable locale in (2) and (3)? (it is usually default but on minimal images, like the one used by docker). This https://unix.stackexchange.com/questions/303712/how-can-i-enable-utf-8-support-in-the-linux-console may gives you some tests (note: UTF-8 terminals are not compatible with ANSI terminals [which expect C1 control codes as single byte). – Giacomo Catenazzi Oct 22 '21 at 07:39
  • @GiacomoCatenazzi For case (2), I am under impression that `std::locale("")` uses the `LC_*` environment variables, which is why I provided `locale` output in both settings. Case (3)... I don't know what locale it uses, since I don't explicitly set it, but I hoped it would do the same (but obviously doesn't). Terminal should not matter, since I tested by piping input in and piping output out (`./a.out < test.txt | xxd`). – Amadan Oct 22 '21 at 07:47
  • It seems all ok. (maybe just to verify with `locale -a` that locales are installed, and maybe you can use the "system name" (and not the alias). In Debian I have the same output as (1), so it is difficult to me to help more. – Giacomo Catenazzi Oct 22 '21 at 08:59
  • Could you also post `xxd < test.txt`? `Why does the second (2) result detect EOF mid-line?` Is MacOS C library close source? I am not sure we can know the answer. `Redhat outputs (3):` soo, what is the locale on redhat? :/ This is getting too broad pretty quick. `What to do to always consistently get the first (1) or last result (4)?` Och, you know, if you want to _get result_, then use normal strings, not wide strings. . Normal strings, will print the input as it is. With wide strings, you _want_ to convert the input to the output encoding, which is.... not always working. – KamilCuk Oct 24 '21 at 19:38
  • 1
    @KamilCuk I posted hexdump of input; there should not be any weirdness there. Input should be UTF-8, output should be UTF-8, and I believe the conversion between UTF-8 and UTF-32 and vice versa (which is AFAIK what macOS uses for `wchar_t`) is deterministic and pretty simple. Re: locale, I posted both environments in the question; if there is more specific data I can (and know how to) provide, let me know. Re: using regular strings, it is not an option for this use case. – Amadan Oct 24 '21 at 21:38
  • 2
    The issue is with `libc++`. It does not handle UTF-8 in `wchar_t` streams correctly. Never use `wcar_t` streams (or any `wchar_t` facilities for that matter). If you need Unicode-specific functionality, use a third-party Unicode library. – n. 1.8e9-where's-my-share m. 12 mins ago – n. m. could be an AI Oct 25 '21 at 04:48
  • Same compiler, same OS, same code, different standard libraries: [one](https://coliru.stacked-crooked.com/a/c328dc0fab9c1899) [two](https://coliru.stacked-crooked.com/a/c328dc0fab9c1899). – n. m. could be an AI Oct 25 '21 at 04:50
  • @n.1.8e9-where's-my-sharem. You posted the same link twice, but I get what you are saying. Thank you very much for the information. Do you have any link where I can learn more? Alas, the original code is not mine, so I'm not sure I can substitute `wchar_t`... – Amadan Oct 25 '21 at 04:55
  • 1
    Sorry, the correct links [one](https://coliru.stacked-crooked.com/a/c328dc0fab9c1899) [two](https://coliru.stacked-crooked.com/a/1a621f0526455cca) – n. m. could be an AI Oct 25 '21 at 05:01
  • 2
    You can modify just the I/O leaving the rest of wchar_t usage intact (read characters as UTF-8 with no in-stream conversion and recode them yourself). Otherwise there are not a lot of options. Either don't use libc++ (use libstdc++ instead, you need to install it and rebuild all the project with it), or fix libc++ and send a pull request to the maintainers. There is a [bug open since 2015](https://bugs.llvm.org/show_bug.cgi?id=24929) so don't expect it will fix itself any time soon. – n. m. could be an AI Oct 25 '21 at 06:40
  • @n.1.8e9-where's-my-sharem. Thank you very much, at least I know I'm not crazy. If you put that into an answer, the bounty is yours. – Amadan Oct 25 '21 at 06:59

1 Answers1

1

This is a libc++ bug.

Note the bug report says that it only affects std::wcin and not file streams, but in my experiments this is not the case. All wchar_t streams seem to be affected.

The other major open source implementation, libstdc++, doesn't have this bug. It is possible to sidestep the libc++ bug by building the entire application (including all dynamic libraries, if any) against libstdc++.

If this is not an option, then one way to cope with the bug is to use narrow char streams, and then, when needed, recode the characters (presumably arriving encoded as UTF-8) to wchar_t (presumably UCS-4) separately. Another way is to get rid of wchar_t altogether and work in UTF-8 throughout the program, which is probably better in the long run.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243