Unicode string indexing in C++

Question

I come from python where you can use 'string[10]' to access a character in sequence. And if the string is encoded in Unicode it will give me expected results. However when I use indexing on a string in C++, as long the characters are ASCII it works, but when I use a Unicode character inside the string and use indexing, in the output I'll get an octal representation like /201. For example:

string ramp = "ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";    
cout << ramp[5] << "\n";

Output:

ÐðŁłŠšÝýÞþŽž
/201

Why this is happening and how can I access that character in the string representation or how can I convert the octal representation to the actual character?

I would recommend using [`std::wstring`](http://en.cppreference.com/w/cpp/string/basic_string) and [`std::wcout`](http://en.cppreference.com/w/cpp/io/cout) — Cory Kramer, Jul 17 '15 at 12:01
@CoryKramer i would not recommend that unconditionally, see e.g. [this](http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring). I would rather recommend using an appropriate library. — Baum mit Augen, Jul 17 '15 at 12:05
@BaummitAugen Good point. Character encoding makes me doubt everything I think I know about programming :/ — Cory Kramer, Jul 17 '15 at 12:06
@Puppy: [ICU](http://site.icu-project.org/) has. And C++ doesn't have native support for GUIs, or audio processing either, but that doesn't make it any less suited for the job. ;-) — DevSolar, Jul 17 '15 at 12:26
@LưuVĩnhPhúc: That would be `u8`, `u`, and `U`. There is no `u16`. ;-) — DevSolar, Jul 17 '15 at 13:04

DevSolar · Accepted Answer · 2015-07-17T13:03:32.930

Standard C++ is not equipped for proper handling of Unicode, giving you problems like the one you observed.

The problem here is that C++ predates Unicode by a comfortable margin. This means that even that string literal of yours will be interpreted in an implementation-defined manner because those characters are not defined in the Basic Source Character set (which is, basically, the ASCII-7 characters minus @, $, and the backtick).

C++98 does not mention Unicode at all. It mentions wchar_t, and wstring being based on it, specifying wchar_t as being capable of "representing any character in the current locale". But that did more damage than good...

Microsoft defined wchar_t as 16 bit, which was enough for the Unicode code points at that time. However, since then Unicode has been extended beyond the 16-bit range... and Windows' 16-bit wchar_t is not "wide" anymore, because you need two of them to represent characters beyond the BMP -- and the Microsoft docs are notoriously ambiguous as to where wchar_t means UTF-16 (multibyte encoding with surrogate pairs) or UCS-2 (wide encoding with no support for characters beyond the BMP).

All the while, a Linux wchar_t is 32 bit, which is wide enough for UTF-32...

C++11 made significant improvements to the subject, adding char16_t and char32_t including their associated string variants to remove the ambiguity, but still it is not fully equipped for Unicode operations.

Just as one example, try to convert e.g. German "Fuß" to uppercase and you will see what I mean. (The single letter 'ß' would need to expand to 'SS', which the standard functions -- handling one character in, one character out at a time -- cannot do.)

However, there is help. The International Components for Unicode (ICU) library is fully equipped to handle Unicode in C++. As for specifying special characters in source code, you will have to use u8"", u"", and U"" to enforce interpretation of the string literal as UTF-8, UTF-16, and UTF-32 respectively, using octal / hexadecimal escapes or relying on your compiler implementation to handle non-ASCII-7 encodings appropriately.

And even then you will get an integer value for std::cout << ramp[5], because for C++, a character is just an integer with semantic meaning. ICU's ustream.h provides operator<< overloads for the icu::UnicodeString class, but ramp[5] is just a 16-bit unsigned integer (1), and people would look askance at you if their unsigned short would suddenly be interpreted as characters. You need the C-API u_fputs() / u_printf() / u_fprintf() functions for that.

#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ustdio.h>

#include <iostream>

int main()
{
    // make sure your source file is UTF-8 encoded...
    icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
    std::cout << ramp << "\n";
    std::cout << ramp[5] << "\n";
    u_printf( "%C\n", ramp[5] );
}

Compiled with g++ -std=c++11 testme.cpp -licuio -licuuc.

ÐðŁłŠšÝýÞþŽž
353
š

(1) ICU uses UTF-16 internally, and UnicodeString::operator[] returns a code unit, not a code point, so you might end up with one half of a surrogate pair. Look up the API docs for the various other ways to index a unicode string.

Errr... which *distribution*? I don't get your meaning. Since there's no binary package for the Mac, I'd think you take the latest source release and compile / install that? — DevSolar, Jul 17 '15 at 18:32
There was binary distribution to install ICU on different platforms, but I guess I'll download the source. Thanks for such a detailed answer. It's a lot of information to digest. — Bahman Eslami, Jul 19 '15 at 15:59
Some Chinese and Emoji characters do not fit in a single UTF-16 character. — Rick James, Apr 09 '17 at 03:55
@RickJames: That is what the part about non-BMP characters and UTF-16 surrogate pairs was about, yes. And even if you're using UTF-32 encoding, there's combining characters. — DevSolar, Apr 09 '17 at 07:27
@DevSolar - "Surrogate pairs" and "Combining characters" are two different issues. The former is for encoding bigger-than-16 bit values using 2 16-bit values. The latter is for having, say, a "combining acute accent" and a letter -- two different codepoints -- to represent an accented letter. Some Chinese and Emoji need Surrogate pairs in UTF-16, but they are not combining characters. — Rick James, Apr 09 '17 at 16:01
@RickJames: Yes, I know that, and (with regards to surrogate pairs) said as much in my answer. You are commenting on this, so apparently you think this answer needs to be improved in some way. What would that be? — DevSolar, Apr 09 '17 at 16:02
A definition of "code unit" would help. And when a C++ programmer needs to care about such. When I followed the ICU code, I got stopped by not finding the definition of `UChar` -- is it 16 bits or 32? So, I could not determine whether that operator[], as you say, returns a code unit or point. And what does return a `codepoint`? — Rick James, Apr 09 '17 at 16:27
@RickJames: Code unit vs. Code point is right there in the footnote. No, this is not a complete primer to Unicode for C++. I meant to write one such for the FAQ, but never got around to actually doing so.... — DevSolar, Apr 09 '17 at 17:39

score 5 · Answer 2 · answered Jul 17 '15 at 12:09

5

C++ has no useful native Unicode support. You almost certainly will need an external library like ICU.

answered Jul 17 '15 at 12:09

Puppy

144,682
38
256
465

ecatmur · Answer 3 · 2015-07-17T12:05:23.503

2

To access codepoints individually, use u32string, which represents a string as a sequence of UTF-32 code units of type char32_t.

u32string ramp = U"ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";    
cout << ramp[5] << "\n";

edited Jul 17 '15 at 12:05

answered Jul 17 '15 at 12:03

ecatmur

152,476
27
293
366

Yes; but you should have mentioned C++11 – Basile Starynkevitch Jul 17 '15 at 12:05
1

interestingly `cout << ramp << "\n"; ` will not compile with [G++ or Clang++ on coliru](http://coliru.stacked-crooked.com/a/b7e2fdf35b9f259f) – NathanOliver Jul 17 '15 at 12:06
1

@NathanOliver Rightfully so, a `char32_t` is not a `char`, and that's what `std::cout` handles. – Baum mit Augen Jul 17 '15 at 12:19
And since `wcout` handles `wchar_t`, which isn't `char32_t` either on Windows, we kind of see where standard C++ still doesn't handle Unicode all that well. Better than C++98, but you still need ICU if you want to go the whole way. – DevSolar Jul 17 '15 at 12:59

score 2 · Answer 4 · edited Jul 01 '16 at 18:25

In my opinion, the best solution is to do any task with strings using iterators. I can't imagine a scenario where one really has to index strings: if you need indexing like ramp[5] in your example, then the 5 is usually computed in other part of the code and usually you scan all the preceding characters anyway. That's why Standard Library uses iterators in its API.

A similar problem comes up if you want to get the size of a string. Should it be character (or code point) count or merely number of bytes? Usually you need the size to allocate a buffer so byte count is more desirable. You only very, very rarely have to get Unicode character count.

If you want to process UTF-8 encoded strings using iterators then I would definitely recommend UTF8-CPP.

score 0 · Answer 5 · edited Jul 22 '15 at 16:00

0

Answering about what is going on, cplusplus.com makes it clear:

Note that this class handles bytes independently of the encoding used: If used to handle sequences of multi-byte or variable-length characters (such as UTF-8), all members of this class (such as length or size), as well as its iterators, will still operate in terms of bytes (not actual encoded characters).

About the solution, others had it right: ICU if you are not using C++11; u32string if you are.

edited Jul 22 '15 at 16:00

Cubbi

46,567
13
103
169

answered Jul 17 '15 at 12:13

styko

641
3
14

1

Even `u32string` isn't a complete answer, unfortunately -- and space-ineffecient as well. I would suggest sticking with ICU even when C++11 is available. – DevSolar Jul 17 '15 at 12:21

Unicode string indexing in C++

5 Answers5

Linked