2

In Java, a String has methods:

length()/charAt(), codePointCount()/codePointAt()

C++11 has std::string a = u8"很烫烫的一锅汤";

but a.size() is the length of char array, cannot index the unicode char.

Is there any solutions for unicode in C++ string ?

Anthony Kong
  • 37,791
  • 46
  • 172
  • 304
linrongbin
  • 2,967
  • 6
  • 31
  • 59
  • 1
    Have you checked this answer?: http://stackoverflow.com/a/31475700/58129 – Anthony Kong Apr 09 '17 at 02:06
  • I usually convert `utf-8` to `UTF-32/UCS-2` `std::wstring` so that each code point is one character. There is code to convert in this answer here: https://stackoverflow.com/questions/42791433/c-tolower-on-special-characters-such-as-%c3%bc/42793626#42793626 else use a library – Galik Apr 09 '17 at 02:24
  • 1
    UCS-2 does not have room for all Chinese characters. – Rick James Apr 09 '17 at 03:52
  • @RickJames: Galik likely meant UTF-16 instead – Remy Lebeau Apr 10 '17 at 20:42
  • 1
    UTF-16 does not have room for all Chinese characters _in a single 'character'_. So `a.size()` will (I think) be incorrect. – Rick James Apr 11 '17 at 05:22

1 Answers1

8

I generally convert the UTF-8 string to a wide UTF-32/UCS-2 string before doing character operations. C++ does actually give us functions to do that but they are not very user friendly so I have written some nicer conversion functions here:

// This should convert to whatever the system wide character encoding 
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

int main()
{
    std::string s = u8"很烫烫的一锅汤";

    auto w = utf8_to_ws(s); // convert to wide (UTF-32/UCS-2)

    // now we can use code-point indexes on the wide string

    std::cout << s << " is " << w.size() << " characters long" << '\n';
}

Output:

很烫烫的一锅汤 is 7 characters long

If you want to convert to and from UTF-32 regardless of platform then you can use the following (not so well tested) conversion routines:

std::string utf32_to_utf8(std::u32string const& utf32)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::string utf8 = cnv.to_bytes(utf32);
    if(cnv.converted() < utf32.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::u32string utf8_to_utf32(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::u32string utf32 = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return utf32;
}

NOTE: As of C++17 std::wstring_convert is deprecated.

However I still prefer to use it over a third party library because it is portable, it avoids external dependencies, it won't be removed until a replacement is provided and in all cases it will be easy to replace the implementations of these functions without having to change all the code that uses them.

Galik
  • 47,303
  • 4
  • 80
  • 117
  • cool, but I have see some discussions, which says, in different platform, wchar_t can be uint16_t, not uint32_t. It can raise error when indexing char in unicode strings. – linrongbin Apr 09 '17 at 02:43
  • 1
    @zhaochenyou This should convert correctly for each platform. On `Windows` it will create `2-byte` `wchar_t` characters encoded in `UCS-2` and on `Linux` it will create `4-byte` `wchar_t` characters encoded with `UTF-32`. – Galik Apr 09 '17 at 02:45
  • This will work well until someone goes and gives you a string with a '' character in it. Then you'll get different lengths on different platforms. – Miles Budnek Apr 09 '17 at 02:53
  • 1
    @MilesBudnek I have added code to convert to `UTF-32` regardless of platform which, I assume, should fix any problems `2 char` encoding may have (your character works fine on `Linux` I can't test on `Windows` unfortunately) – Galik Apr 09 '17 at 02:59
  • 1
    Yes, all currently existing Unicode code points will fit into a single UTF-32 unit. – Miles Budnek Apr 09 '17 at 03:03
  • @Galik: Note that Windows uses UTF-16, not UCS-2. UTF-16 is the successor to UCS-2. Both encodings can be represented using 16bit `wchar_t` on Windows. – Remy Lebeau Apr 10 '17 at 20:44
  • @RemyLebeau Okay. But as far as these conversion routines go, according to the standard, they convert between `UTF-8` and `UCS-2` on systems where `wchar_t` is `2` bytes wide. `UTF-16` doesn't have a `1` to `1` character to code point relation so you may want to use the functions that convert to `UTF-32` if `UCS-2` is insufficient. – Galik Apr 10 '17 at 22:25
  • `std::wstring_convert` converts between UTF-8 and UCS-2 if you use `std::codecvt_utf8`, but converts between UTF-8 and UTF-16 if you use `std::codecvt_utf8_utf16` instead. See the table on [cppreference.com](http://en.cppreference.com/w/cpp/locale/wstring_convert). In your examples, `utf8_to_ws()` and `ws_to_utf8()` should be using `std::codecvt_utf8_utf16` on Windows – Remy Lebeau Apr 10 '17 at 22:39
  • @RemyLebeau I understand that, but `UTF-16` can't be used to solve this particular question because it uses surrogate-pairs. Hence you can use `UTF-32` if `UCS-2` is insufficient. – Galik Apr 10 '17 at 22:46
  • Frightening to think that at some point in the future all Unicode code point may not fit into UTF-32. – vy32 Dec 01 '20 at 02:37