119

My platform is a Mac. I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.

I read some posts on Stack Overflow, and many of them suggest using std::string when dealing with UTF-8 and avoid wchar_t as there's no char8_t right now for UTF-8.

However, none of them talk about how to properly deal with functions like str[i], std::string::size(), std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing UTF-8.

Should I go ahead with std::string or switch to std::wstring? If I should stay with std::string, what's the best practice for one to handle the above problems?

Enlico
  • 23,259
  • 6
  • 48
  • 102
Saddle Point
  • 3,074
  • 4
  • 23
  • 33
  • 14
    See also [utf8everywhere](http://utf8everywhere.org/) – Caleth May 18 '18 at 09:05
  • 3
    Why (and how?!) would you use `std::wstring` with UTF-8? – Jonathan Wakely May 18 '18 at 11:26
  • 6
    `std::string::size()` is only surprising if you expect ot to do something other than return the length in **bytes** i.e. code units (not the number of code points in the string). And `str[i]` returns the i-th **byte** in the string. But that would still be true even if C++ had a `char8_t` type specifically for UTF-8. – Jonathan Wakely May 18 '18 at 11:30
  • This might be a bit off-topic, but why C++? It's a rather a second-class citizen on the Mac, Apple provide much better support for Objective-C and (more recently) Swift. On the basis that it sounds like you're writing a command-line app, you might like to take a look at [this](https://www.raywenderlich.com/163134/command-line-programs-macos-tutorial-2). Then you can stop worrying about C++'s crappy support for Unicode and get on with writing your program. Google `swift unicode` and `swift regex`, it's all done for you. – Paul Sanders May 22 '18 at 20:01
  • PS: what does the program actually _do_? – Paul Sanders May 22 '18 at 20:22
  • @PaulSanders Hi, I'm writing a Chinese word segmenter. The Python version is very slow. After a quick try I find that the C++ version is much faster than the Python and I'm going to wrap it using Cython. The main issue is the string manipulation in C++ when facing unicode. I've also find other segmentation libs also implement there [own](https://github.com/yanyiwu/cppjieba/blob/master/include/cppjieba/Unicode.hpp) string-like class to do this. About the mac, it's just my develop environment :) – Saddle Point May 23 '18 at 01:56
  • OK, thanks. So the finished app has to run on other platforms? If so, that is a game-changer. Let me know either way and I'll try to post something useful. And maybe add a bit of detail to your question about how you plan to go about this task. I don't suppose anybody here has the faintest clue, we're all shooting in the dark. – Paul Sanders May 23 '18 at 06:14
  • I posted an answer [here](https://stackoverflow.com/questions/50413471/what-exactly-can-wchar-t-represent/50540643#50540643) that you might find interesting. (I am just after those 300 points :). Also, there are native API's on the Mac that you might want to get at that the approach outlined there would facilitate. If that aspect interests you then I can post more details as an answer here. If you want your code to run cross-platform then check out [ICU](http://site.icu-project.org/). – Paul Sanders May 28 '18 at 14:48
  • You don't actually need to use `char8_t` because the C++ standard guarantees that the standard `char` type is at least 1 byte, so on all modern machines with an 8 bit byte `char` could actually hold a unicode code point – txk2048 Jun 17 '20 at 08:04
  • 2
    re "why C++?". portability. apples are not the only fruit :) –  Nov 22 '20 at 09:28

5 Answers5

173

Unicode Glossary

Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:

  1. Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
  2. Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.

This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.


UTF Primer

Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.

In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:

  • UTF-8: 1 to 4 Code Units,
  • UTF-16: 1 or 2 Code Units,
  • UTF-32: 1 Code Unit.

std::string and std::wstring.

  1. Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
  2. The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
  3. While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.

If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.

Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.


Picking std::string or std::u32string?

If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.

If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.

If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.


UTF-8 in std::string.

UTF-8 actually works quite well in std::string.

Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:

  • str.find('\n') works,
  • str.find("...") works for matching byte by byte1,
  • str.find_first_of("\r\n") works if searching for ASCII characters.

Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.

Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.

Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".

1 The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.

Community
  • 1
  • 1
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • Thanks for the great details! I'm trying to take some time to figure all these out! About the original questions, besides `str.find_first_of`, `str.find` or `std::regex` seems not work for non ASCII inputs (e.g. "哈" or u8"哈") given `std::string str(u8"哈哈haha");` – Saddle Point May 18 '18 at 09:42
  • 6
    @Edityouprofile: `str.find("哈")` should work (see https://ideone.com/s9i1yf), but `str.find('哈')` will not because `'哈'` is a multi-byte characters. `str.find_first_of("哈")` will not work (only works for ASCII patterns). Regex should work fine for ASCII patterns; however beware of character classes and "repeaters" (eg. `"哈?"` may only make the last byte conditional). – Matthieu M. May 18 '18 at 11:06
  • My bad, `str.find` works. Is there any way to fixed the string size/length issue and string iteration issue? – Saddle Point May 18 '18 at 11:16
  • 1
    For portability, would `std::basic_string` work as expected on both *nix and Windows? – Quentin May 18 '18 at 11:22
  • 2
    @Quentin: Yes. I should add it to the list of alternatives! By the way, there's a nifty typedef: `std::u32string`. – Matthieu M. May 18 '18 at 11:23
  • 2
    `str.find("...")str.fin works` only if you only care about matching byte-for-byte - otherwise you'll need a proper normalisation-and-locale-aware comparison. Other than that this seems like a pretty good answer, and shows why I kind of hate the Unicode "support" which exists in languages like Python3. – Muzer May 18 '18 at 12:38
  • 1
    @Muzer: Ah yes indeed, only matching byte for byte works. I'll amend with concerns about normalization/collation/locales. – Matthieu M. May 18 '18 at 12:41
13

std::string and friends are encoding-agnostic. The only difference between std::wstring and std::string are that std::wstring uses wchar_t as the individual element, not char. For most compilers the latter is 8-bit. The former is supposed to be large enough to hold any unicode character, but in practice on some systems it isn't (Microsoft's compiler, for example, uses a 16-bit type). You can't store UTF-8 in std::wstring; that's not what it's designed for. It's designed to be an equivalent of UTF-32 - a string where each element is a single Unicode codepoint.

If you want to index UTF-8 strings by Unicode codepoint or composed unicode glyph (or some other thing), count the length of a UTF-8 string in Unicode codepoints or some other unicode object, or find by Unicode codepoint, you're going to need to use something other than the standard library. ICU is one of the libraries in the field; there may be others.

Something that's probably worth noting is that if you're searching for ASCII characters, you can mostly treat a UTF-8 bytestream as if it were byte-by-byte. Each ASCII character encodes the same in UTF-8 as it does in ASCII, and every multi-byte unit in UTF-8 is guaranteed not to include any bytes in the ASCII range.

James Picone
  • 1,509
  • 9
  • 18
  • 1
    Also note that `sizeof(wchar_t)` is platform-dependent. FOr example, it's 2 bytes on my Windows machine, and 4 bytes on a Linux machine. Your platform may expose other values. – Sergey May 18 '18 at 03:41
  • @Sergey After I posted this I realised that it probably is more complicated than that and looked up the reference. Apparently it's "Required to be large enough to represent any supported character code point". But Windows has never really supported unicode properly, so Microsoft getting it wrong isn't a surprise :P – James Picone May 18 '18 at 03:47
  • If it doesn't specify what is the supported encoding, it'll be an uphill battle to show that Microsoft is doing anything wrong. – zneak May 18 '18 at 03:51
  • https://stackoverflow.com/a/38771118/5557309 quotes standard: "Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).". Implication: Microsoft's compiler doesn't support Unicode. – James Picone May 18 '18 at 03:57
  • 2
    I don't believe that this is the implication. UTF-16 is a perfectly fine way to encode Unicode. – zneak May 18 '18 at 04:01
  • 4
    "distinct codes for all members of the largest extended character set" means that a single wchar_t has to be able to represent any valid Unicode code point if your compiler supports Unicode. 16 bits isn't enough for that. UTF-16 is a multi-byte encoding; it's not relevant here. – James Picone May 18 '18 at 04:04
  • 1
    I believe that the distinction that you are making is arbitrary. If UTF-16 is able to represent all of the characters that UTF-32 can, then what is the harm? Moreover, why is UTF-32 considered compliant when it still needs multiple code points to encode several Unicode logical characters such as flags? – zneak May 18 '18 at 04:06
  • 6
    The harm is that `std::wstring` really shouldn't be a multi-byte encoding; that's the point of the type. Making it a multi-byte encoding (and a bad one at that) is just duplicating `std::string`, but in a really annoying way that tricks people into thinking their code does Unicode properly. – James Picone May 18 '18 at 04:13
  • 13
    @zneak it's actually Unicode's fault, not Microsoft's. They told Microsoft that characters were 16-bit, then Microsoft went and made them 16-bit, then they said "oops, no, they have to be 20.5-bit". The only reason \*nixes don't have the same problem is because they *didn't support Unicode at all* until after the 20.5-bit decision was made. – user253751 May 18 '18 at 04:15
  • UTF-8 predates the release of Windows NT, which was the first Windows to support Unicode. But to be fair, only by about 6 months, and Windows NT started development four years before UTF-8 was developed. I do think it should have been obvious that UCS-2 wasn't going to be a functional encoding from the beginning, though. – James Picone May 18 '18 at 04:32
  • 1
    Plan 9 came out before WinNT and supported UTF-8, for example. – James Picone May 18 '18 at 04:37
  • 2
    @JamesPicone, that doesn’t explain the free pass that UTF-32 gets. Every UTF encoding, including UTF-32, is a multi-byte encoding. – zneak May 18 '18 at 04:40
  • 6
    @zneak UTF-32 isn't a multi-byte encoding in the same way UTF-16 is. UTF-16 sometimes requires multiple values to represent single unicode codepoints. UTF-32 sometimes requires multiple unicode codepoints to represent single graphemes. They're both tricky, but they're tricky at different levels. – James Picone May 18 '18 at 04:47
  • 1
    The considerations are the same. In both cases, you can't guess the logical length of the string from its byte size, and you can break up logical characters by splitting at the wrong index. – zneak May 18 '18 at 04:50
  • 2
    "Size in codepoints" and "size in graphemes" are different problems, and you want the former more than the latter. "Filter out this codepoint" and "filter out this grapheme" are different problems, and you want the former more than the latter. Etc. – James Picone May 18 '18 at 04:54
  • 1
    Given that we've already established that the size in code units and the size in code points have the same weaknesses, I struggle to find a case where one is significantly better than the other. It also seems that we won't agree on that, however. – zneak May 18 '18 at 05:18
  • 9
    @JamesPicone: "Variable-width encoding" is probably a more appropriate term than "multi-byte encoding". – user2357112 May 18 '18 at 07:03
  • Ken Thompson was the developer for Plan 9 and also the inventor for UTF-8 so no doubt Plan 9 had early UTF-8 support. And it seems [16-bit `wchar_t` is valid](https://stackoverflow.com/q/39548465/995714) – phuclv May 18 '18 at 08:06
  • 2
    In fact ["In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF"](https://en.wikipedia.org/wiki/UTF-8#History) so Plan 9 had UTF-8 support even before the official release of UTF-8 – phuclv May 18 '18 at 08:15
  • 1
    @JamesPicone - Microsoft found a loophole in *"among the supported locales"*. So VC++ only formally supports locales where 16-bit characters work. Everything else is a non-standard extension, which the C++ standard says nothing about. – Bo Persson May 18 '18 at 08:41
  • as long as [`__STDC_ISO_10646__` is not defined](https://stackoverflow.com/a/11107667/995714) 16-bit and even 8-bit `wchar_t` are still valid – phuclv May 19 '18 at 13:56
11

Consider upgrading to C++20 and std::u8string that is the best thing we have as of 2019 for holding UTF-8. There are no standard library facilities to access individual code points or grapheme clusters but at least your type is strong enough to at least say it is true UTF-8.

  • this u8string has to be the way to go - see https://stackoverflow.com/questions/56420790/how-stdu8string-will-be-different-from-stdstring ... boost's no_wide is pretty interesting for streaming utf8 - https://www.boost.org/doc/libs/1_74_0/libs/nowide/doc/html/index.html –  Nov 22 '20 at 09:22
  • 7
    Definitely avoid u8string because it's poorly supported in the standard. You won't even be able to output it. – vitaut Dec 27 '20 at 18:02
8

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

For both, size tracks the number of code units instead of the number of code points, or grapheme clusters. (A code point is one named Unicode entity, one or more of which form a grapheme cluster. Grapheme clusters are the visible characters that users interact with, like letters or emojis.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of grapheme clusters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

zneak
  • 134,922
  • 42
  • 253
  • 328
  • A number of Chinese characters are multi-byte strings in UTF-16. You can search a UTF-16 for anything in the basic multilingual plane safely though. – James Picone May 18 '18 at 04:02
  • 2
    Thanks for you answer. While `std::string str(u8"哈哈haha");str.find_first_of(u8"haha");` seems to work, `str.find_first_of(u8"哈ha");` always return 0. And regex seems not working too. – Saddle Point May 18 '18 at 04:08
  • 1
    @Edityouprofile, this is my mistake: I confused `find_first_of` with `find`. `find_first_of` cannot work with multi-byte characters. – zneak May 18 '18 at 04:09
  • 12
    "*For both, `size` tracks the number of code points*" - wrong, it represents **code units**, not **code points**. Big difference. "*instead of the number of logical characters. (Logical characters are one or more code points.)*" - also known more formally as a Grapheme Cluster. – Remy Lebeau May 18 '18 at 05:14
  • 2
    I don't think that the standard *requires* `std::string` to be in UTF8, even if we tend to have [UTF8 everywhere](http://utf8everywhere.org/). I guess that an EBCDIC mainframe might use EBCDIC for `std::string` – Basile Starynkevitch May 18 '18 at 11:30
  • 18
    `std::string` doesn't "use" any encoding, neither UTF-8 nor EBCDIC. `std::string` is just a container for bytes of types `char`. You can put UTF-8 strings in there, or ASCII strings, or EBCDIC strings, or even binary data. The encoding of those bytes (if any) is determined by the rest of your program and what you do with the string, not by `std::string` itself. – Jonathan Wakely May 18 '18 at 11:34
  • @JonathanWakely, you are correct. If you were to use Unicode in string or wstring, those are the encodings that you would end up with, by necessity, but you could be using mostly any other encoding that you prefer. Those are out of scope for this question. – zneak May 18 '18 at 14:04
  • 1
    `str.find_first_of(u8"哈ha")` is wrong, but by coincidence, it is showing the right result. Note that the input is a set of characters. `find_first_of("123")` does not look for `"123"`, it looks for either `'1'`, `'2'`, or `'3'`. Here `u8"哈ha"` is 5 bytes, and `find_first_of` will look for any of those bytes. It finds one at position zero and reports it. To use this function safely, make sure the argument is ASCII. `str` itself may contain any Unicode character. – Barmak Shemirani May 22 '18 at 20:17
3

Should I go ahead with std::string or switch to std::wstring?

I would recommend using std::string because wchar_t is non-portable and C++20 char8_t is poorly supported in the standard and not supported by any system APIs at all (and will likely never be because of compatibility reasons). On most platforms including macOS that you are using normal char strings are already UTF-8.

Most of the standard string operations work with UTF-8 but operate on code units. If you want a higher-level API you'll have to use something else such as the text library proposed to Boost.

vitaut
  • 49,672
  • 25
  • 199
  • 336