Any good solutions for C++ string code point and code unit?

Question

In Java, a String has methods:

length()/charAt(), codePointCount()/codePointAt()

C++11 has std::string a = u8"很烫烫的一锅汤";

but a.size() is the length of char array, cannot index the unicode char.

Is there any solutions for unicode in C++ string ?

Have you checked this answer?: http://stackoverflow.com/a/31475700/58129 — Anthony Kong, Apr 09 '17 at 02:06
I usually convert `utf-8` to `UTF-32/UCS-2` `std::wstring` so that each code point is one character. There is code to convert in this answer here: https://stackoverflow.com/questions/42791433/c-tolower-on-special-characters-such-as-%c3%bc/42793626#42793626 else use a library — Galik, Apr 09 '17 at 02:24
UTF-16 does not have room for all Chinese characters _in a single 'character'_. So `a.size()` will (I think) be incorrect. — Rick James, Apr 11 '17 at 05:22

Galik · Accepted Answer · 2018-12-03T13:37:14.727

8

I generally convert the UTF-8 string to a wide UTF-32/UCS-2 string before doing character operations. C++ does actually give us functions to do that but they are not very user friendly so I have written some nicer conversion functions here:

// This should convert to whatever the system wide character encoding 
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

int main()
{
    std::string s = u8"很烫烫的一锅汤";

    auto w = utf8_to_ws(s); // convert to wide (UTF-32/UCS-2)

    // now we can use code-point indexes on the wide string

    std::cout << s << " is " << w.size() << " characters long" << '\n';
}

Output:

很烫烫的一锅汤 is 7 characters long

If you want to convert to and from UTF-32 regardless of platform then you can use the following (not so well tested) conversion routines:

std::string utf32_to_utf8(std::u32string const& utf32)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::string utf8 = cnv.to_bytes(utf32);
    if(cnv.converted() < utf32.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::u32string utf8_to_utf32(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::u32string utf32 = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return utf32;
}

NOTE: As of C++17 std::wstring_convert is deprecated.

However I still prefer to use it over a third party library because it is portable, it avoids external dependencies, it won't be removed until a replacement is provided and in all cases it will be easy to replace the implementations of these functions without having to change all the code that uses them.

edited Dec 03 '18 at 13:37

answered Apr 09 '17 at 02:37

Galik

47,303
4
80
117

cool, but I have see some discussions, which says, in different platform, wchar_t can be uint16_t, not uint32_t. It can raise error when indexing char in unicode strings. – linrongbin Apr 09 '17 at 02:43
1

@zhaochenyou This should convert correctly for each platform. On `Windows` it will create `2-byte` `wchar_t` characters encoded in `UCS-2` and on `Linux` it will create `4-byte` `wchar_t` characters encoded with `UTF-32`. – Galik Apr 09 '17 at 02:45
This will work well until someone goes and gives you a string with a '' character in it. Then you'll get different lengths on different platforms. – Miles Budnek Apr 09 '17 at 02:53
1

@MilesBudnek I have added code to convert to `UTF-32` regardless of platform which, I assume, should fix any problems `2 char` encoding may have (your character works fine on `Linux` I can't test on `Windows` unfortunately) – Galik Apr 09 '17 at 02:59
1

Yes, all currently existing Unicode code points will fit into a single UTF-32 unit. – Miles Budnek Apr 09 '17 at 03:03
@Galik: Note that Windows uses UTF-16, not UCS-2. UTF-16 is the successor to UCS-2. Both encodings can be represented using 16bit `wchar_t` on Windows. – Remy Lebeau Apr 10 '17 at 20:44
@RemyLebeau Okay. But as far as these conversion routines go, according to the standard, they convert between `UTF-8` and `UCS-2` on systems where `wchar_t` is `2` bytes wide. `UTF-16` doesn't have a `1` to `1` character to code point relation so you may want to use the functions that convert to `UTF-32` if `UCS-2` is insufficient. – Galik Apr 10 '17 at 22:25
`std::wstring_convert` converts between UTF-8 and UCS-2 if you use `std::codecvt_utf8`, but converts between UTF-8 and UTF-16 if you use `std::codecvt_utf8_utf16` instead. See the table on [cppreference.com](http://en.cppreference.com/w/cpp/locale/wstring_convert). In your examples, `utf8_to_ws()` and `ws_to_utf8()` should be using `std::codecvt_utf8_utf16` on Windows – Remy Lebeau Apr 10 '17 at 22:39
@RemyLebeau I understand that, but `UTF-16` can't be used to solve this particular question because it uses surrogate-pairs. Hence you can use `UTF-32` if `UCS-2` is insufficient. – Galik Apr 10 '17 at 22:46
Frightening to think that at some point in the future all Unicode code point may not fit into UTF-32. – vy32 Dec 01 '20 at 02:37

Any good solutions for C++ string code point and code unit?

1 Answers1

Linked