I've noticed the length method of std::string returns the length in bytes and the same method in std::u16string returns the number of 2-byte sequences.
I've also noticed that when a character or code point is outside of the BMP, length returns 4 rather than 2.
Furthermore, the Unicode escape sequence is limited to \unnnn, so any code point above U+FFFF cannot be inserted by an escape sequence.
In other words, there doesn't appear to be support for surrogate pairs or code points outside of the BMP.
Given this, is the accepted or recommended practice to use a non-standard string manipulation library that understands UTF-8, UTF-16, surrogate pairs, and so on?
Does my complier have a bug or am I using the standard string manipulation methods incorrectly?
Example:
/*
* Example with the Unicode code points U+0041, U+4061, U+10196 and U+10197
*/
#include <iostream>
#include <string>
int main(int argc, char* argv[])
{
std::string example1 = u8"A䁡";
std::u16string example2 = u"A䁡";
std::cout << "Escape Example: " << "\u0041\u4061\u10196\u10197" << "\n";
std::cout << "Example: " << example1 << "\n";
std::cout << "std::string Example length: " << example1.length() << "\n";
std::cout << "std::u16string Example length: " << example2.length() << "\n";
return 0;
}
Here is the result I get when compiled with GCC 4.7:
Escape Example: A䁡မ6မ7
Example: A䁡
std::string Example length: 12
std::u16string Example length: 6