12

I've noticed the length method of std::string returns the length in bytes and the same method in std::u16string returns the number of 2-byte sequences.

I've also noticed that when a character or code point is outside of the BMP, length returns 4 rather than 2.

Furthermore, the Unicode escape sequence is limited to \unnnn, so any code point above U+FFFF cannot be inserted by an escape sequence.

In other words, there doesn't appear to be support for surrogate pairs or code points outside of the BMP.

Given this, is the accepted or recommended practice to use a non-standard string manipulation library that understands UTF-8, UTF-16, surrogate pairs, and so on?

Does my complier have a bug or am I using the standard string manipulation methods incorrectly?

Example:

/*
* Example with the Unicode code points U+0041, U+4061, U+10196 and U+10197
*/

#include <iostream>
#include <string>

int main(int argc, char* argv[])
{
    std::string example1 = u8"A䁡";
    std::u16string example2 = u"A䁡";

    std::cout << "Escape Example: " << "\u0041\u4061\u10196\u10197" << "\n";
    std::cout << "Example: " << example1 << "\n";
    std::cout << "std::string Example length: " << example1.length() << "\n";
    std::cout << "std::u16string Example length: " << example2.length() << "\n";

    return 0;
}

Here is the result I get when compiled with GCC 4.7:

Escape Example: A䁡မ6မ7
Example: A䁡
std::string Example length: 12
std::u16string Example length: 6
Naveen
  • 74,600
  • 47
  • 176
  • 233

3 Answers3

9

std::basic_string is code unit oriented, not character oriented. If you need to deal with code points you can convert to char32_t, but there's nothing in the standard for more advanced Unicode functionality yet.

Also you can use the \UNNNNNNNN escape sequence for non-BMP code points, in addition to typing them in directly (assuming you're using a source encoding that supports them).

Depending on your needs this may be all the Unicode support you need. A lot of software doesn't need to do more than basic manipulations of strings, such as those that can easily be done on code units directly. For slightly higher level needs you can convert code units to code points and work on those. For higher level needs, such as working on grapheme clusters, additional support will be needed.

I would say this means there's adequate support in the standard for representing Unicode data and performing basic manipulation. Whatever third party library is used for higher level functionality should build on the standard library. As time goes on the standard is likely to subsume more of that higher level functionality as well.

bames53
  • 86,085
  • 15
  • 179
  • 244
7

At the risk of judging prematurely, it seems to me that the language used in the standards in slightly ambiguous (although the final conclusion is clear, see at the end):

In the description of char16_t literals (i.e. the u"..." ones like in your example), the size of a literal is defined as:

The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.

And the footnote further clarifies:

[ Note: The size of a char16_t string literal is the number of code units, not the number of characters. —end note ]

This implies a definition of character and code unit. A surrogate pair is one character, but two code units.

However, the description of the length() method of std::basic_string (of which std::u16string is derived) claims:

Returns the number of characters in the string, i.e. std::distance(begin(), end()). It is the same as size().

As it appears, the description of length() uses the word character to mean what the definition of char16_t calls a code unit.

However, the conclusion of all of this is: The length is defined as code units, hence your compiler complies with the standard, and there will be continued demand for special libraries to provide proper counting of characters.

I used the references below:

  • For the definition of the size of char16_t literals: Here
  • For the description of std::basic_string::length(): Here
jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • 1
    Thanks for the answer. I'm also interested in other string manipulation methods like substr and how they handle UTF-8, UTF-16, surrogate pairs, etc. I should have been more clear. I used length because it was the easiest example to post. –  Feb 28 '12 at 05:40
  • @Ragsdale 30 cal Right. I suppose we will have to accept that all these methods operate on code units, not characters, despite the somewhat misleading descriptions. Iterators are another good example. – jogojapan Feb 28 '12 at 05:50
  • So in other words, the only standard way to work with Unicode is to convert text to UTF-32 and use std::u32string? That seems rather wasteful. –  Feb 28 '12 at 06:04
  • @Ragsdale 30 cal It certainly depends. I personally work with Japanese, where surrogate characters are very rare. I often simply assume they never occur. The odd length count may be wrong because of that. For `substr` I actually think there is no problem as long as both strings involved are well-formed UTF16. Iterators can cause nasty issues, though: You have to test each element for whether it is in the surrogate range and possibly merge it with the following element. Or make a custom iterator like [here](http://members.shaw.ca/akochoi/articles/unicode-processing-c++0x/index.html). – jogojapan Feb 28 '12 at 06:30
  • 1
    "Strictly speaking, this means the description of length() uses the word character wrongly" - I think it means the standard defines the word "character" differently in different places. A `u16string` is not a `char16_t` string literal, so the context is different. In the context of `std::basic_string`, and hence `u16string` too, a "character" is one of whatever type the `charT` template parameter is. I agree that it's a bit confusing to do this, but the usage in `basic_string` was there first, and the description of Unicode string literals and their "characters" is more recent. – Steve Jessop Feb 28 '12 at 08:46
  • @Steve Jessop True. I shall replace "wrongly" with "to mean what the `char16_t` definition calls 'code unit'". – jogojapan Feb 28 '12 at 09:36
  • People are starting to think about unicode in the standard library for TR2: http://www.open-std.org/JTC1/sc22/WG21/docs/papers/2012/n3336.html#Problem-1. Somewhere there it mentions that UTF-32 is probably the best target for lossless conversion. I'm no expert here though. – emsr Feb 28 '12 at 20:46
0

Given this, is the accepted or recommended practice to use a non-standard string manipulation library that understands UTF-8, UTF-16, surrogate pairs, and so on?

It's hard to talk about recommended practice for a language standard that was created a few months ago and isn't fully implemented yet, but in general I would agree: the locale and Unicode features in C++11 are still hopelessly inadequate (although they obviously got a lot better), and for serious work, you should drop them and use ICU or Boost.Locale instead.

The addition of Unicode strings and conversion functions to C++11 are the first step towards real Unicode support; time will tell whether they turn out to be useful or whether they will be forgotten.

Philipp
  • 48,066
  • 12
  • 84
  • 109