Better way to traverse std::string with variable length characters?

Question

Give a string std::string str = "google谷歌", traverse it and print each character:

for (uint32 i = 0; i <= str.length(); ++i)
    std::cout << str[i] << std::endl;

, which prints:

g
o
o
g
l
e
�
�
�
�
�
�

This is obviously wrong, and I change to use std::wstring:

for (uint32 i = 0; i <= str.length(); ++i)
    std::cout << str[i] << std::endl;

, which prints:

Above are the raw integer data of each characters, which are correct. I could use the utf8cpp library to convert them to utf8 and print correctly.

The question is: Is there any easy way to traverse std::string with variable length characters without using std::wstring?

I also have few ugly code here:

bool Utf8toWStr(const std::string& utf8str, std::wstring& wstr)
{
    size_t len = utf8::distance(utf8str.c_str(), utf8str.c_str() + utf8str.size());
    wstr.resize(len);

    if (len)
        utf8::utf8to16(utf8str.c_str(), utf8str.c_str() + utf8str.size(), &wstr[0]);
    return true;
}
bool WStrToUtf8(std::wstring wstr, std::string& utf8str)
{
    std::string utf8str2;
    utf8str2.resize(wstr.size() * 4);                   // allocate for most long case

    char* oend = utf8::utf16to8(wstr.c_str(), wstr.c_str() + wstr.size(), &utf8str2[0]);
    utf8str2.resize(oend - (&utf8str2[0]));             // remove unused tail
    utf8str = utf8str2;

    return true;
}
std::string m_text;
std::wstring textWStr;
Utf8toWStr(m_text, textWStr);
auto textLen = textWStr.length();
for (uint32 1 = 1; i <= textLen; ++i)
{
    std::wstring subWStr = textWStr.substr(0, i);
    std::string subStr;
    WStrToUtf8(subWStr, subStr);
    std::cout << "subStr = " << subStr << std::endl;
}

Print “each character”… what is a character? Is it a grapheme cluster? (Don’t use wstring or wchar_t ever, by the way – they have all the same problems as string/char on Windows and then some.) — Ry-, Apr 04 '18 at 03:55
@Ry So the ultimate way is to use Boost.Text, like Henri Menke pointed? — ryancheung, Apr 04 '18 at 04:05
It's relatively easy to recognize the end points of UTF-8 sequences: If the next byte is not in the range `0x80` to `0xbf`, it's safe to cut off. — cmaster - reinstate monica, Apr 04 '18 at 05:10
`wstring` is not necessarily utf-16 encoded. You can use `u16string`. — xskxzr, Apr 04 '18 at 05:26
Please note that utf16 is also a variable-width encoding!!! Using `wchar_t` won't save you from splitting eg. 'PILE OF POO' (U+1F4A9) in half:( — el.pescado - нет войне, Apr 04 '18 at 05:37

score 2 · Accepted Answer · answered Apr 04 '18 at 04:56

2

Don't use std::wstring and friends except to interface with broken libraries (for example, the Windows API). They only ever make the problem worse. UTF16 is still a variable-width encoding.

The correct solution is to use UTF8 everywhere, as discussed here.

Iterating through 'characters' in a UTF8 string, where 'character' is either code-point or grapheme cluster, is not a feature of the standard library. ICU is a fairly common choice for that task. If you just want to output the string, just feed the entire string to std::cout, which should handle UTF8 correctly. If you're stuck with Windows, use a wrapper that forwards to std::cout in good standard libraries and forwards a converted std::string to std::wcout in bad ones.

answered Apr 04 '18 at 04:56

James Picone

1,509
9
18

I also want to substring, like the ugly code i put. ICU seems a little complicate compared to Boost.Text(proposed). – ryancheung Apr 04 '18 at 05:45
Getting substrings will also require something that's encoding-aware (like ICU). The Boost.Text library you've been linked to might work as well; I'd just note that it's not as mature as ICU and isn't actually a Boost library yet. Documentation claims to have the requisite operations. – James Picone Apr 04 '18 at 06:17
Interesting part about `cout`. I think a terminal set up for utf-8 would display the proper glyphs for the byte stream without any previous processing, wouldn't it? What would a standard C++ stream do to it? What would be the result if I output it to a `stringstream` instead? A different string? – Peter - Reinstate Monica Apr 04 '18 at 07:29
@PeterA.Schneider it's implementation-defined behaviour, and the standard library provided with Visual Studio treats `std::cout` as displaying some ANSI codepage. Similarly there's no standard way to open a file with a filename that can't be represented in an ANSI codepage in Visual Studio's C++ standard library (although `std::filesystem` support might change that). – James Picone Apr 04 '18 at 08:44

Better way to traverse std::string with variable length characters?

1 Answers1