2

What are the proper facilities to be using for full unicode in C++?

For example, I have tried:

int main()                                                                                                                                                                 
{                                                                                                                                                                          
    std::wstring name;                                                                                                                                                 
    std::wcout << "Enter unicode: " << std::endl;                                                                                                                
    std::getline(std::wcin, name);                                                                                                                                     

    std::wcout << name << std::endl;                                                                                                                                   

    return 0;                                                                                                                                                              
}  

And it doesn't work as I would expect when entering the character: or others that are not in the Unicode BMP. I get an empty line printed out.

Plain string works for any code points up to 16bits, wstring, wcin, wcout just don't work as I'd expect and some Googling hasn't helped me see what about this could be wrong.

EDIT (file I/O also has issues!):

I wondered if this could have something do with with the console I/O itself and wanted to try the same for file I/O as an experiment. I looked into the api's and came up with this which compiles and runs fine:

int main()                                                                                                                                                                 
{                                                                                                                                                                          
    std::string filename;                                                                                                                                                  
    std::cout << "Enter file to append to: " << std::endl;                                                                                                                 
    std::getline(std::cin, filename);                                                                                                                                      

    std::wifstream file;                                                                                                                                                   
    std::wstringstream buff;                                                                                                                                               
    file.open(filename);                                                                                                                                                   
    std::wstring txt;                                                                                                                                                      
    buff << file.rdbuf();                                                                                                                                                  
    file.close();                                                                                                                                                          
    txt = buff.str();                                                                                                                                                      

    std::wcout << txt << std::endl;                                                                                                                                        

    return 0;                                                                                                                                                              
}                                                                                                                                                                          

But when I point it to my file with mostly lorem ipsum and a few non-BMP characters, it prints the file up to the first non-BMP character and then stops early. Can the Unicode facilities in modern C++ really be this bad?

I'm sure someone knows something basic I am missing here...

Thomas
  • 929
  • 2
  • 9
  • 10
  • possible duplicate of https://stackoverflow.com/questions/3207704/how-can-i-cin-and-cout-some-unicode-text – Sahib Yar Aug 22 '17 at 05:10
  • 1
    Thanks Sahib. I am on debian and looking for the platform neutral solution, plain C++ solution. This link appears to provide a very windows-specific answer. – Thomas Aug 22 '17 at 05:34
  • 1
    You should use L prefix for Unicode string literals: std::wcout << L"Enter unicode: " << std::endl; – Asesh Aug 22 '17 at 06:00
  • Okay, good tip. Doesn't affect my bug in this case. – Thomas Aug 22 '17 at 06:21
  • Related reading: https://stackoverflow.com/q/17103925/1025391 and https://stackoverflow.com/q/402283/1025391 – moooeeeep Aug 22 '17 at 07:21
  • `std::wstring` is not Unicode. Not any more than `std::string`. Or `char*`. Or `int*`. – rubenvb Aug 22 '17 at 07:23
  • 1
    It indeed depends ultimately on the output stream. Windows, for example, by default, doesn't support the full set of unicode characters in the standard console. https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how/388500#388500 so you'll have to add some platform dependent tweaks. – Simon Mourier Aug 22 '17 at 07:45
  • How did you examine the file to determine it does not work? What is `sizeof(wchar_t)`? – Yakk - Adam Nevraumont Aug 22 '17 at 07:59
  • I was using the console. In the second case, cat and emacs24 both showed the issue. – Thomas Aug 26 '17 at 04:39

1 Answers1

1

You are in the gray zone of C++ unicode. Unicode initially started by an extension of the 7 bits ASCII characters, or multi-byte characters to plain 16 bits characters, what later became the BMP. Those 16 bits characters were adopted natively by languages like Java and systems like Windows. C and C++ being more conservative on a standard point of view decided that wchar_t would be an implementation dependant wide character set that could be 16 or 32 bits wide (or even more...) depending on requirement. The good side was that it was extensible, the dark side was that it was never made clear how non BMP unicode characters should be represented when wchar_t is only 16 bits.

UTF-16 was then created to allow a standard representation of those non BMP characters, with the downside that they need 2 16 bits characters, and that the std::char_traits<wchar_t>::length would again be wrong if some of them are present in a wstring.

That's the reason why most C++ implementation choosed that wchar_t basic IO would only process correctly BMP unicode characters for length to return a true number of characters.

The C++-ish way is to use char32_t based strings when full unicode support is required. In fact wstring_t and wchar_t (prefix L for litteral) are implementation dependant types, and since C++11, you also have char16_t and u16string (prefix u) that explicitely use UTF-16, or char32_t and u32string (prefix U) for full unicode support through UTF-32. The problem of storing characters outside the BMP in a u16string, is that you lose the property size of string == number of characters, which was a key reason for using wide characters instead of multi-byte characters.

One problem for u32string is that the io library still has no direct specialization for 32 bit characters, but as the converters have, you can probably use them easily when you process files with a std::basic_fstream<char32_t> (untested but according to standard should work). But you will have no standard stream for cin, cout and cerr, and will probably have to process the native from in string or u16string, and then convert everything in u32string with the help of the standard converters introduced in C++14, or the hard way if using only C++11.

The really dark side, is that as that native part currently depend on the OS, you will not be able to setup a fully portable way to process full unicode - or at least I know none.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • 1
    `std::u16string` is not limited to the BMP (any more than `std::string` is when using UTF-8, or `std::wstring` is when using UTF-16/32, depending on the size of `wchar_t`). `char16_t` and `std::u16string` are specifically designed for UTF-16 (the `u""` string prefix returns a full UTF-16 encoded `std::u16string`, not a UCS-2 encoded one like you are implying). UTF-16 handles the entire Unicode repertoire. `char32_t` and `std::u32string` are designed for UTF-32, which also handles the entire Unicode repertoire. – Remy Lebeau Aug 23 '17 at 02:35
  • @RemyLebeau: Thanks for commenting! What I meant is that when you use UTF-16 encoded strings in u16string, you lose the propertly *length == number of chars*, which was a key reason for using wide characters instead of multi-byte ones. Hope it is more clear now – Serge Ballesta Aug 23 '17 at 07:48
  • the *length == number of chars* property was lost decades ago with the invention of MBCS charsets. Only code that deals exclusively in English and other Latin-based languages could ever rely on that property. True international apps don't use that property for a long time. That being said, the majority of modern languages fit in the BMP, so the property *usually* holds up in most texts. Asian languages, and more recently emojis, and less-common uses of Unicode (ancient languages, math/music symbols etc), require the use of surrogates. – Remy Lebeau Aug 23 '17 at 08:13