191

I've read and heard that C++11 supports Unicode. A few questions on that:

  • How well does the C++ standard library support Unicode?
  • Does std::string do what it should?
  • How do I use it?
  • Where are potential problems?
R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
Ralph Tandetzky
  • 22,780
  • 11
  • 73
  • 120
  • 1
    Possibly useful: http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring - first result from googling "std:string unicode C++". – Dan Puzey Jun 14 '13 at 08:13
  • 22
    "Does std::string do what it should?" What do you think it should do? – R. Martinho Fernandes Jun 14 '13 at 10:18
  • 2
    I use http://utfcpp.sourceforge.net/ for my utf8 needs. Its a simple header file which provides iterators for unicode strings. – fscan Jun 14 '13 at 13:09
  • 2
    std::string should store bytes, i.e. code unit sequence of UTF-8 encoding. Yes, it does just that, since the beginning. http://utf8everywhere.org – Pavel Radzivilovsky Jun 15 '13 at 15:42
  • 4
    The biggest potential problems with Unicode support lie within Unicode and its use in information technology itself. Unicode is not suitable (and not designed) for what it's used for. Unicode is designed to reproduce every possible glyph that has been written somewhere by someone, at some time with every unlikely and pedantic nuance possible, including 3 or 4 different meanings and 3 or 4 different ways of composing the same glyph. It's not meant to be useful for being used for everyday language, and it's not meant to be applicable or to be easily or unambiguously processed. – Damon Jun 26 '13 at 11:04
  • 14
    Yes it is designed for being used for everyday language. Mine at least. And yours most probably too. It just turns out that processing human text in a general way is a very difficult task. It's not even possible to define unambiguously what a character is. General glyph reproduction is not even really part of the Unicode charter. – Jean-Denis Muys Aug 22 '13 at 14:17
  • 1
    @PavelRadzivilovsky - Very good point. It does indeed store bytes and they do not have to be text characters. string::data allows for access to them. –  Sep 26 '13 at 11:44
  • 1
    @R.MartinhoFernandes, I am going to add a question to "what std::string should do": JSONcpp uses std::string, so will it break when a utf-8 char contains 0x22 or 0x2c in the multiple byte content? – Splash Nov 07 '15 at 23:44
  • 3
    0x22 and 0x2c never appear in multiple byte sequences. UTF-8 was designed so that each byte is only ever one of { single byte sequence, start of multiple byte sequence, continuation of multiple byte sequence }. So 0x22 always means U+0022 and 0x2c always means U+002C. Regardless, I would expect any such library to handle this properly (i.e. if it didn't, I'd blame the library, not `std::string`; `std::string` does everything it should) – R. Martinho Fernandes Nov 09 '15 at 08:47

5 Answers5

280

How well does the C++ standard library support unicode?

Terribly.

A quick scan through the library facilities that might provide Unicode support gives me this list:

  • Strings library
  • Localization library
  • Input/output library
  • Regular expressions library

I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.

Does std::string do what it should?

Yes. According to the C++ standard, this is what std::string and its siblings should do:

The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.

Well, std::string does that just fine. Does that provide any Unicode-specific functionality? No.

Should it? Probably not. std::string is fine as a sequence of char objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.

How do I use it?

Use it as a sequence of char objects; pretending it is something else is bound to end in pain.

Where are potential problems?

All over the place? Let's see...

Strings library

The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.

It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16 and c32rtomb/mbrtoc32.

Localization library

The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.

Consider, for example, what the standard calls "convenience interfaces" in the <locale> header:

template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...

How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"" or u8"\U0001F34C"? There's no way it will ever work, because those functions take only one code unit as input.

This could work with an appropriate locale if you used char32_t only: U'\U0001F34C' is a single code unit in UTF-32.

However, that still means you only get the simple casing transformations with toupper and tolower, which, for example, are not good enough for some German locales: "ß" uppercases to "SS"☦ but toupper can only return one character code unit.

Next up, wstring_convert/wbuffer_convert and the standard code conversion facets.

wstring_convert is used to convert between strings in one given encoding into strings in another given encoding. There are two string types involved in this transformation, which the standard calls a byte string and a wide string. Since these terms are really misleading, I prefer to use "serialized" and "deserialized", respectively, instead†.

The encodings to convert between are decided by a codecvt (a code conversion facet) passed as a template type argument to wstring_convert.

wbuffer_convert performs a similar function but as a wide deserialized stream buffer that wraps a byte serialized stream buffer. Any I/O is performed through the underlying byte serialized stream buffer with conversions to and from the encodings given by the codecvt argument. Writing serializes into that buffer, and then writes from it, and reading reads into the buffer and then deserializes from it.

The standard provides some codecvt class templates for use with these facilities: codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16, and some codecvt specializations. Together these standard facets provide all the following conversions. (Note: in the following list, the encoding on the left is always the serialized string/streambuf, and the encoding on the right is always the deserialized string/streambuf; the standard allows conversions in both directions).

  • UTF-8 ↔ UCS-2 with codecvt_utf8<char16_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 2;
  • UTF-8 ↔ UTF-32 with codecvt_utf8<char32_t>, codecvt<char32_t, char, mbstate_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 4;
  • UTF-16 ↔ UCS-2 with codecvt_utf16<char16_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 2;
  • UTF-16 ↔ UTF-32 with codecvt_utf16<char32_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 4;
  • UTF-8 ↔ UTF-16 with codecvt_utf8_utf16<char16_t>, codecvt<char16_t, char, mbstate_t>, and codecvt_utf8_utf16<wchar_t> where sizeof(wchar_t) == 2;
  • narrow ↔ wide with codecvt<wchar_t, char_t, mbstate_t>
  • no-op with codecvt<char, char, mbstate_t>.

Several of these are useful, but there is a lot of awkward stuff here.

First off—holy high surrogate! that naming scheme is messy.

Then, there's a lot of UCS-2 support. UCS-2 is an encoding from Unicode 1.0 that was superseded in 1996 because it only supports the basic multilingual plane. Why the committee thought desirable to focus on an encoding that was superseded over 20 years ago, I don't know&ddagger;. It's not like support for more encodings is bad or anything, but UCS-2 shows up too often here.

I would say that char16_t is obviously meant for storing UTF-16 code units. However, this is one part of the standard that thinks otherwise. codecvt_utf8<char16_t> has nothing to do with UTF-16. For example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\U0001F34C") will compile fine, but will fail unconditionally: the input will be treated as the UCS-2 string u"\xD83C\xDF4C", which cannot be converted to UTF-8 because UTF-8 cannot encode any value in the range 0xD800-0xDFFF.

Still on the UCS-2 front, there is no way to read from an UTF-16 byte stream into an UTF-16 string with these facets. If you have a sequence of UTF-16 bytes you can't deserialize it into a string of char16_t. This is surprising, because it is more or less an identity conversion. Even more suprising, though, is the fact that there is support for deserializing from an UTF-16 stream into an UCS-2 string with codecvt_utf16<char16_t>, which is actually a lossy conversion.

The UTF-16-as-bytes support is quite good, though: it supports detecting endianess from a BOM, or selecting it explicitly in code. It also supports producing output with and without a BOM.

There are some more interesting conversion possibilities absent. There is no way to deserialize from an UTF-16 byte stream or string into a UTF-8 string, since UTF-8 is never supported as the deserialized form.

And here the narrow/wide world is completely separate from the UTF/UCS world. There are no conversions between the old-style narrow/wide encodings and any Unicode encodings.

Input/output library

The I/O library can be used to read and write text in Unicode encodings using the wstring_convert and wbuffer_convert facilities described above. I don't think there's much else that would need to be supported by this part of the standard library.

Regular expressions library

I have expounded upon problems with C++ regexes and Unicode on Stack Overflow before. I will not repeat all those points here, but merely state that C++ regexes don't have level 1 Unicode support, which is the bare minimum to make them usable without resorting to using UTF-32 everywhere.

That's it?

Yes, that's it. That's the existing functionality. There's lots of Unicode functionality that is nowhere to be seen like normalization or text segmentation algorithms.

U+1F4A9. Is there any way to get some better Unicode support in C++?

The usual suspects: ICU and Boost.Locale.


† A byte string is, unsurprisingly, a string of bytes, i.e., char objects. However, unlike a wide string literal, which is always an array of wchar_t objects, a "wide string" in this context is not necessarily a string of wchar_t objects. In fact, the standard never explicitly defines what a "wide string" means, so we're left to guess the meaning from usage. Since the standard terminology is sloppy and confusing, I use my own, in the name of clarity.

Encodings like UTF-16 can be stored as sequences of char16_t, which then have no endianness; or they can be stored as sequences of bytes, which have endianness (each consecutive pair of bytes can represent a different char16_t value depending on endianness). The standard supports both of these forms. A sequence of char16_t is more useful for internal manipulation in the program. A sequence of bytes is the way to exchange such strings with the external world. The terms I'll use instead of "byte" and "wide" are thus "serialized" and "deserialized".

&ddagger; If you are about to say "but Windows!" hold your . All versions of Windows since Windows 2000 use UTF-16.

☦ Yes, I know about the großes Eszett (ẞ), but even if you were to change all German locales overnight to have ß uppercase to ẞ, there's still plenty of other cases where this would fail. Try uppercasing U+FB00 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟɪɢᴀᴛᴜʀᴇ ғғ. There is no ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟɪɢᴀᴛᴜʀᴇ ғғ; it just uppercases to two Fs. Or U+01F0 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴊ ᴡɪᴛʜ ᴄᴀʀᴏɴ; there's no precomposed capital; it just uppercases to a capital J and a combining caron.

R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
  • 29
    The more I read about it, the more I got the feeling to not understand a thing about all this. I read most of this stuff a couple months ago and still feel like I'm discovering the whole thing all over again... To keep it simple for my poor brain that now hurts a bit, all these advices on [utf8everywhere](http://utf8everywhere.org/#how) are still valid, right? If I "just" want my users to be able to open and write files no matter their system settings I can ask them the file name, store it in a std::string and everything should work properly, even on Windows? Sorry to ask that (again)... – Uflex Jun 18 '13 at 14:42
  • 6
    @Uflex All you can *really* do with std::string is to treat it as a binary blob. In a proper Unicode implementation neither the internal (because it's hidden deep in implementation details) nor external encoding matters (well, sorta, you still need to have encoder/decoder available). – Cat Plus Plus Jun 18 '13 at 15:17
  • 3
    @Uflex maybe. I don't know if following advice you don't understand is a good idea. – R. Martinho Fernandes Jun 18 '13 at 15:51
  • 1
    There is a proposal for Unicode support in C++ 2014/17. However that is 1, maybe 4 years away and of little use now. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3572.html – graham.reeds Jun 26 '13 at 10:19
  • 21
    @graham.reeds haha, thanks, but I was aware of that. Check the "Acknowledgments" section ;) – R. Martinho Fernandes Jun 26 '13 at 10:22
  • 1
    Yeah, I skimmed over it a few weeks ago. Didn't notice your name at the end (or I did but didn't twig). It lets people know that efforts are happening towards the direction. – graham.reeds Jun 26 '13 at 12:33
  • 1
    @R.MartinhoFernandes "Input/output library ... I don't think there's anything else to would need to be supported by this part of the standard library." Not true. The standard fstream(char*) functions that accept a char* filename seem to take implementation-defined encodings. On Windows, this is the system-default codepage and you can't change it; you get undesirable conversions. On Windows you *have* to use MSFT-proprietary wchar_t* filenames. On Linux, the char* filenames are typically assumed to be UTF-8. – James Johnston Oct 09 '14 at 22:54
  • Question about your answer, to make certain I'm understanding correctly. When you say **There is no way to read from an UTF-16 byte stream or string into a UTF-8 string, since UTF-8 is never supported at the wide end.**, do you mean that if you have a `wchar[]` and for some reason you want to store a string as UTF-8 inside it (i.e., you just want to treat the `wchar[]` as simply a buffer, not caring about the fact that it's defined as an array of `wchar`), you can't? – Dan Nissenbaum Dec 21 '14 at 21:36
  • No. `wstring_convert` always converts from what the standard calls a 'byte' string to what the standard calls a 'wide' string. This choice of words is quite confusing (I should reword the answer, I guess). A better choice would be 'serialized' and 'deserialized'. 'Serialized' is always as bytes meant for use external to your program (a file, the network, whatever), and 'deserialized' is always as whatever your program wants to use internally. In the list of conversions I gave the left is serialized and the right is deserialized. There is no provided way to deserialize anything into UTF-8... – R. Martinho Fernandes Dec 22 '14 at 04:18
  • 2
    ... no matter how you want to store it (but `wchar_t` would be stupid). Unless, of course, it's an identity conversion. This whole thing just feels yucky. Whoever designed this had no idea what they were doing, and the committee approved it :( – R. Martinho Fernandes Dec 22 '14 at 04:23
  • @Dan does that make it clearer? If so, I'll put it in the answer. – R. Martinho Fernandes Dec 22 '14 at 04:30
  • Thanks. Do I understand correctly that "wide" corresponds to "serialized to external byte string", and "narrow" corresponds to "deserialized to internal wide string"? If so, I think it would help to give one sentence explaining why "wide" means "external byte string" and "narrow" means "internal wide string" - is the internal buffer *always* required to be `wchar_t`? (If not, why is it always called "wide"?) One last point: I notice a two-way arrow in your list of codecvt class templates (i.e., UTF-8 <--> UCS-2); perhaps you could reiterate that -> means deserializing and <- serializing. – Dan Nissenbaum Dec 23 '14 at 05:28
  • 1
    Thanks for the feedback! I'll update the answer when I'm not browsing from a phone. But you misunderstood (could be my fault). The bytes end is always with `char` and is the external end. The "wide" end can be of any type and is the internal end. "Narrow/wide" is really not the same thing. The standard just uses really confusing terminology for orthogonal concepts. I'll be extra careful updating the answer to make sure I steer clear of that confusion. – R. Martinho Fernandes Dec 23 '14 at 07:11
  • 1
    @DanNissenbaum I edited the answer, tiptoeing around the terminology. – R. Martinho Fernandes Dec 23 '14 at 08:01
  • The Boost.Locale URL has a trailing `1` (and, apparently, I can't edit it because SO rejects edits that are less than 6 chars). – Alexandros Mar 18 '16 at 19:29
  • @Alexandros thanks for noticing! I fixed it. (You can do 1-char edits after a certain rep level, if I recall correctly) – R. Martinho Fernandes Mar 19 '16 at 10:57
  • A is a , of course of course, and no one can talk to a of course, that is of course, unless the , Is the famous Mister Ed! – Jason Hutchinson Jun 21 '16 at 18:45
  • @R.MartinhoFernandes are you sure there's no UTF-16 → UTF-8 conversion? [This says otherwise](http://coliru.stacked-crooked.com/a/261f0187ada65f80). – bit2shift Mar 15 '17 at 02:08
  • `u8""` is a bad example imho, because it can be anything as it relies on implementation defined behavior. is outside of character set defined in C++ ISO(2.3). Implementation can probably map `u8""` to `u8"\U0001F34C"` (or to anything it likes; and I'm not certain about one to many mapping, because I can't find "map" definition in the standard). The whole point about still holds though ^_^ – Sergey.quixoticaxis.Ivanov Apr 17 '18 at 17:49
  • 1
    @Sergey.quixoticaxis.Ivanov what do you mean by "C++ ISO(2.3)"? There's no leeway for implementation defined behaviour here. See http://eel.is/c++draft/lex.string#9. If you can type it, your implementation must make it `{ 0xF0, 0x9F, 0x8D, 0x8C, 0 }`. UTF-8 doesn't care what the character looks like: it encodes any thing from U+0000 to U+10FFFF (excluding U+D800 to U+DFFF), regardless of what meaning it has. So, unless you're using a source encoding that can't encode U+1F34C BANANA, there's no implementation-defined behaviour. – R. Martinho Fernandes Apr 20 '18 at 15:07
  • 1
    If you are using a source encoding that can't encode bananas, then you "can't type it", so there's no behaviour to discuss. If you mean that the current normative references refer to ISO/IEC 10646-1:1993, which is pre-Unicode 2.0, and thus only covers code points up to U+FFFF, then yes, technically the standard doesn't prescribe the behaviour. However, it's pretty clear that ISO/IEC 10646-1:1993 is intended to be included with all amendments, since the standard refers to things like "UTF-8", "UTF-16", "surrogate pairs", and the 0000-10FFFF range. This is a defect we (SG16) are working on atm. – R. Martinho Fernandes Apr 20 '18 at 15:11
  • 1
    @bit2shift There's some confusion. Note I said UTF-8 isn't supported as the "deserialized" form (implying that UTF-16 is the "serialized" end of the conversion, meaning it comes as bytes, not as 16-bit units). I edited the text for clarity now to say "from a UTF-16 **byte** stream". If you have a stream of bytes, you can't decode those to UTF-8 (you can't even decode those to a `u16string` because `codecvt_utf16` converts to UCS-2, against all odds) – R. Martinho Fernandes Apr 20 '18 at 15:16
  • (Well, you can decode to a u16string, but not to a UTF-16 u16string :D) – R. Martinho Fernandes Apr 20 '18 at 15:24
  • @R.MartinhoFernandes [lex.charset] defines all characters that are usable by the language, everything from the source file is mapped to this character set according to implementation defined rules ([lex.phases]). Anything outside this set is converted to universal character name. Maybe I miss something, but I can't see any rule that prevents my implementation from mapping character to, for example, letter 'a' thus exchanging u8"" to u8"a". As far as I understand, [lex.string] speaks about language after the first phase of physical file mapping. – Sergey.quixoticaxis.Ivanov Apr 20 '18 at 15:29
  • @R.MartinhoFernandes sorry about 2.3. I was reading a copy of C++11 standard and [lex.charset] is lex 2.3. My bad. – Sergey.quixoticaxis.Ivanov Apr 20 '18 at 15:32
  • @R.MartinhoFernandes the same goes for non-unicode. Nothing prevents an implementation from mapping "abc" in the physical file to "that is my favorite string". Because my physical file can be in any form (for example, sound or pictures), it only needs to be mapped to [lex.charset]. – Sergey.quixoticaxis.Ivanov Apr 20 '18 at 15:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169454/discussion-between-r-martinho-fernandes-and-sergey-quixoticaxis-ivanov). – R. Martinho Fernandes Apr 20 '18 at 15:52
41

Unicode is not supported by Standard Library (for any reasonable meaning of supported).

std::string is no better than std::vector<char>: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blob of bytes.

If you only need to store and catenate blobs, it works pretty well; but as soon as you wish for Unicode functionality (number of code points, number of graphemes etc) you are out of luck.

The only comprehensive library I know of for this is ICU. The C++ interface was derived from the Java one though, so it's far from being idiomatic.

informatik01
  • 16,038
  • 10
  • 74
  • 104
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • 2
    How about [Boost.Locale](http://www.boost.org/doc/libs/1_53_0/libs/locale/doc/html/index.html)? – Uflex Jun 14 '13 at 10:01
  • 11
    @Uflex: from the page you linked *In order to achieve this goal Boost.Locale uses the-state-of-the-art Unicode and Localization library: ICU - International Components for Unicode.* – Matthieu M. Jun 14 '13 at 13:42
  • 1
    Boost.Locale supports other non-ICU backends, see here: http://www.boost.org/doc/libs/1_53_0/libs/locale/doc/html/using_localization_backends.html – Superfly Jon Jul 26 '16 at 15:05
  • @SuperflyJon: True, but according to that same page, the support for Unicode of the non-ICU backends is "severely limited". – Matthieu M. Jul 26 '16 at 15:20
28

You can safely store UTF-8 in a std::string (or in a char[] or char*, for that matter), due to the fact that a Unicode NUL (U+0000) is a null byte in UTF-8 and that this is the sole way a null byte can occur in UTF-8. Hence, your UTF-8 strings will be properly terminated according to all of the C and C++ string functions, and you can sling them around with C++ iostreams (including std::cout and std::cerr, so long as your locale is UTF-8).

What you cannot do with std::string for UTF-8 is get length in code points. std::string::size() will tell you the string length in bytes, which is only equal to the number of code points when you're within the ASCII subset of UTF-8.

If you need to operate on UTF-8 strings at the code point level (i.e. not just store and print them) or if you're dealing with UTF-16, which is likely to have many internal null bytes, you need to look into the wide character string types.

informatik01
  • 16,038
  • 10
  • 74
  • 104
uckelman
  • 25,298
  • 8
  • 64
  • 82
  • 4
    `std::string` can be thrown into iostreams with embedded nulls just fine. – R. Martinho Fernandes Jun 14 '13 at 10:12
  • Is that according to the standard, or just an accident? I thought it would be kind of dodgy to try putting embedded nulls into a `std::string`, as that completely breaks `std::string::c_str()`. – uckelman Jun 14 '13 at 11:13
  • 3
    It's totally intended. It doesn't break `c_str()` at all because `size()` still works. Only broken APIs (i.e. those that cannot handle embedded nulls like most of the C world) break. – R. Martinho Fernandes Jun 14 '13 at 11:14
  • 2
    Embedded nulls break `c_str()` because `c_str()` is supposed to return the data as a null-terminated C string---which is impossible, due to the fact that C strings cannot have embedded nulls. – uckelman Jun 14 '13 at 11:16
  • 4
    Not anymore. `c_str()` now simply returns the same as `data()`, i.e. all of it. APIs that take a size can consume it. APIs that don't, cannot. – R. Martinho Fernandes Jun 14 '13 at 11:25
  • 6
    With the slight difference that `c_str()` makes sure the result is followed by a NUL char-like object, and I don't think `data()` does. Nope, looks like `data()` now does that too. (Of course, this is not necessary for APIs that consume the size instead of inferring it from a terminator search) – Ben Voigt Jun 15 '13 at 20:52
  • 1
    Wide characters are UTF-16, not UTF-32, on Windows. Which makes them no better... and actually worse in some ways, since you'll probably test your program with BMP characters, conclude it works, then it will fail later when someone tries to use a non-BMP character. – user253751 Jul 04 '15 at 11:05
8

C++11 has a couple of new literal string types for Unicode.

Unfortunately the support in the standard library for non-uniform encodings (like UTF-8) is still bad. For example there is no nice way to get the length (in code-points) of an UTF-8 string.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • So do we still need to use std::wstring for file names if we want to support non-latin languages? Because the new string literals don't really help here as the string usually come from the user... – Uflex Jun 14 '13 at 08:38
  • 7
    @Uflex `std::string` can *hold* a UTF-8 string without problem, but e.g. the `length` method returns the number of bytes in the string and not the number of code-points. – Some programmer dude Jun 14 '13 at 09:07
  • 9
    To be honest, getting the length in code points of a string doesn't have many uses. The length in bytes can be used to correctly pre-allocate buffers, for example. – R. Martinho Fernandes Jun 26 '13 at 10:29
  • 2
    The number of code points in an UTF-8 string is not a very interesting number: One can write `ñ` as 'LATIN SMALL LETTER N WITH TILDE' (U+00F1) (which is one code point) or 'LATIN SMALL LETTER N' (U+006E) followed by 'COMBINING TILDE' (U+0303) which is two code points. – Martin Bonner supports Monica Sep 08 '16 at 07:25
  • All those comments about "you don't need this and you don't need that" like "number of code points unimportant" etc. sounds a bit fishy to me. Once you write a parser which is supposed to parse utf8 source code of sorts, it is up to the specification of the parser whether or not it considers ``LATIN SMALL LETTER N' `` == ``(U+006E) followed by 'COMBINING TILDE' (U+0303)``. – BitTickler Aug 13 '19 at 07:32
5

However, there is a pretty useful library called tiny-utf8, which is basically a drop-in replacement for std::string/std::wstring. It aims to fill the gap of the still missing utf8-string container class.

This might be the most comfortable way of 'dealing' with utf8 strings (that is, without unicode normalization and similar stuff). You comfortably operate on codepoints, while your string stays encoded in run-length-encoded chars.

Jakob Riedle
  • 1,969
  • 1
  • 18
  • 21