2

Give a string std::string str = "google谷歌", traverse it and print each character:

for (uint32 i = 0; i <= str.length(); ++i)
    std::cout << str[i] << std::endl;

, which prints:

g
o
o
g
l
e
�
�
�
�
�
�

This is obviously wrong, and I change to use std::wstring:

for (uint32 i = 0; i <= str.length(); ++i)
    std::cout << str[i] << std::endl;

, which prints:

103
111
111
103
108
101
35895
27468
0

Above are the raw integer data of each characters, which are correct. I could use the utf8cpp library to convert them to utf8 and print correctly.

The question is: Is there any easy way to traverse std::string with variable length characters without using std::wstring?

I also have few ugly code here:

bool Utf8toWStr(const std::string& utf8str, std::wstring& wstr)
{
    size_t len = utf8::distance(utf8str.c_str(), utf8str.c_str() + utf8str.size());
    wstr.resize(len);

    if (len)
        utf8::utf8to16(utf8str.c_str(), utf8str.c_str() + utf8str.size(), &wstr[0]);
    return true;
}
bool WStrToUtf8(std::wstring wstr, std::string& utf8str)
{
    std::string utf8str2;
    utf8str2.resize(wstr.size() * 4);                   // allocate for most long case

    char* oend = utf8::utf16to8(wstr.c_str(), wstr.c_str() + wstr.size(), &utf8str2[0]);
    utf8str2.resize(oend - (&utf8str2[0]));             // remove unused tail
    utf8str = utf8str2;

    return true;
}
std::string m_text;
std::wstring textWStr;
Utf8toWStr(m_text, textWStr);
auto textLen = textWStr.length();
for (uint32 1 = 1; i <= textLen; ++i)
{
    std::wstring subWStr = textWStr.substr(0, i);
    std::string subStr;
    WStrToUtf8(subWStr, subStr);
    std::cout << "subStr = " << subStr << std::endl;
}
ryancheung
  • 2,999
  • 3
  • 24
  • 25

1 Answers1

2

Don't use std::wstring and friends except to interface with broken libraries (for example, the Windows API). They only ever make the problem worse. UTF16 is still a variable-width encoding.

The correct solution is to use UTF8 everywhere, as discussed here.

Iterating through 'characters' in a UTF8 string, where 'character' is either code-point or grapheme cluster, is not a feature of the standard library. ICU is a fairly common choice for that task. If you just want to output the string, just feed the entire string to std::cout, which should handle UTF8 correctly. If you're stuck with Windows, use a wrapper that forwards to std::cout in good standard libraries and forwards a converted std::string to std::wcout in bad ones.

James Picone
  • 1,509
  • 9
  • 18
  • I also want to substring, like the ugly code i put. ICU seems a little complicate compared to Boost.Text(proposed). – ryancheung Apr 04 '18 at 05:45
  • Getting substrings will also require something that's encoding-aware (like ICU). The Boost.Text library you've been linked to might work as well; I'd just note that it's not as mature as ICU and isn't actually a Boost library yet. Documentation claims to have the requisite operations. – James Picone Apr 04 '18 at 06:17
  • Interesting part about `cout`. I think a terminal set up for utf-8 would display the proper glyphs for the byte stream without any previous processing, wouldn't it? What would a standard C++ stream do to it? What would be the result if I output it to a `stringstream` instead? A different string? – Peter - Reinstate Monica Apr 04 '18 at 07:29
  • @PeterA.Schneider it's implementation-defined behaviour, and the standard library provided with Visual Studio treats `std::cout` as displaying some ANSI codepage. Similarly there's no standard way to open a file with a filename that can't be represented in an ANSI codepage in Visual Studio's C++ standard library (although `std::filesystem` support might change that). – James Picone Apr 04 '18 at 08:44