utf8mb4 encode/decode in c++

Question

A third-part server echoes string to my client program, the string contains both utf8 data and unicode emoji (listed here). for example:

I googled some time and found this is called utf8mb4 encoding, which is used in SQL application.

I find some article about utf8mb4 in mysql/python/ruby/etc... but no c++. Is there any c++ library can do encoding/decoding utf8mb4?

utf8 by definition can't be 5 bytes (see for example [this](https://dev.mysql.com/doc/refman/5.0/en/charset-unicode-utf8.html) that references various standards). So the right part of the image is wrong. MySQL calls [utf8mb4 what is in truth utf8](https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html), so any library that supports utf8 will give you utf8mb4. What MySQL calls utf8 is up-to-3-bytes utf8 (see the same page) — xanatos, Aug 16 '15 at 07:51
Sorry I've updated the image, the right part is two three-byte utf8 data. — aj3423, Aug 16 '15 at 08:00
In C++11 they added directly in the C++ libraries some standard way to do it: http://en.cppreference.com/w/cpp/locale/codecvt — xanatos, Aug 16 '15 at 08:07

score 1 · Accepted Answer · edited May 23 '17 at 11:50

1

MySQL calls utf8mb4 what is in truth utf8:

The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters:

so any library that supports utf8 will give you utf8mb4. In this question it was asked what solutions are there in C++ for converting to/from utf8: How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8 . The three solutions given are ICU (International Components for Unicode), Boost.Locale and C++11.

edited May 23 '17 at 11:50

Community

1
1

answered Aug 16 '15 at 08:17

xanatos

109,618
12
197
280

I tried the c++ 11 version, here's my code: `http://pastebin.com/gkCUaS5U`, but after convering to std::wstring, the emoji was lost, the wstring only contains three 'a'. – aj3423 Aug 16 '15 at 08:57
@aj3423 Strange: https://ideone.com/5iAhzF prints 6. What compiler are you using? – xanatos Aug 16 '15 at 08:59
Sorry my mistake ,the length is 6, "\x61\x00\x61\x00\x61\x00". it only contains the three 'a', the rest are gone. – aj3423 Aug 16 '15 at 09:11
@aj3423 You are confusing `char` with `wchar_t`. `sizeof(wchar_t)` is 2 (Visual C++) or 4 bytes long (GCC). So it is 6x wchar_t. See the revised example: https://ideone.com/KAACJt – xanatos Aug 16 '15 at 09:58
@aj3423 Technically in C++11 to solve the problem that `wchar_t` wasn't uniquely defined (lets say that in Visual C++ it is UTF-16, while in GCC it is UTF-32), they created a whole group of new containers for UTF-16 and UTF-32, `char16_t`, `char32_t`, `u16string`, `u32string`, plus some new literals (as seen http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11) – xanatos Aug 16 '15 at 10:01
A fourth solution is [Ogonek](https://github.com/rmartinho/ogonek). Still not mature though. But by far (!!!) the best API of any of the alternatives. – Konrad Rudolph Aug 16 '15 at 13:03
1

@aj3423 Ok... It is a cesspit... Two (three) problems: Visual C++ has a 2 bytes `wchar_t` that can't contain all the unicode characters. Ideone uses GCC that has `wchar_t` of 4 bytes. Use this line: `std::wstring_convert > cvt_utf8;`. This will convert from utf8 to utf16, while the other line I had given you was converting to UCS2/UCS4. Second problem: you can't `wcout << L"ws: " << ws << endl`, because the `wcout` breaks and stops working (I don't know why). The `for` cycle works correctly. https://ideone.com/KAACJt Note that now the length is 7xUTF-16 points – xanatos Aug 16 '15 at 16:34
`\x61\x00` is not `char` for `a`. Emoji need 4 bytes of utf8 or 2 words of utf16 or 1 word of utf32. One wchar_t (if it is only 16 bits) cannot hold an Emoji. – Rick James Mar 07 '16 at 19:06

utf8mb4 encode/decode in c++

1 Answers1