21

I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).

Example

The text "नमस्ते" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947, of which, U+0938 and U+0947 are combining marks.

static void Main(string[] args)
{
    const string s = "नमस्ते";

    Console.WriteLine(s.Length); // Ouptuts "6"

    var l = 0;
    var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
    while(e.MoveNext()) l++;
    Console.WriteLine(l); // Outputs "4"
}

So there we have it in .NET. We also have Win32's CharNextW()

#include <Windows.h>
#include <iostream>
#include <string>

int main()
{
    const wchar_t * s = L"नमस्ते";

    std::cout << std::wstring(s).length() << std::endl; // Gives "6"

    int l = 0;
    while(CharNextW(s) != s)
    {
        s = CharNextW(s);
        ++l;
    }

    std::cout << l << std::endl; // Gives "4"

    return 0;
}

Question

Both ways I know of are specific to Microsoft. Are there portable ways to do it?

  • I heard about ICU but I couldn't find something related quickly (UnicodeString(s).length() still gives 6). Would be an acceptable answer to point to the related function/module in ICU.
  • C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.

Edit: Correct answer using ICU

@McDowell gave the hint to use BreakIterator from ICU, which I think can be regarded as the de-facto cross-platform standard to deal with Unicode. Here's an example code to demonstrate its use (since examples are surprisingly rare):

#include <unicode/schriter.h>
#include <unicode/brkiter.h>

#include <iostream>
#include <cassert>
#include <memory>

int main()
{
    const UnicodeString str(L"नमस्ते");

    {
        // StringCharacterIterator doesn't seem to recognize graphemes
        StringCharacterIterator iter(str);
        int count = 0;
        while(iter.hasNext())
        {
            ++count;
            iter.next();
        }
        std::cout << count << std::endl; // Gives "6"
    }

    {
        // BreakIterator works!!
        UErrorCode err = U_ZERO_ERROR;
        std::unique_ptr<BreakIterator> iter(
            BreakIterator::createCharacterInstance(Locale::getDefault(), err));
        assert(U_SUCCESS(err));
        iter->setText(str);

        int count = 0;
        while(iter->next() != BreakIterator::DONE) ++count;
        std::cout << count << std::endl; // Gives "4"
    }

    return 0;
}
Frank
  • 64,140
  • 93
  • 237
  • 324
kizzx2
  • 18,775
  • 14
  • 76
  • 83
  • Your title should read: "Cross Platform iteration of UTF-16 string" – Chris Becke Jan 02 '11 at 16:16
  • @Chris: the problem is not specific to UTF-16, though, so it's more like "unicode non-utf32" :) – Roman L Jan 02 '11 at 16:22
  • 2
    surrogate pairs are a utf-16 artifact. utf-8 just codes the final codepoint. – Chris Becke Jan 02 '11 at 16:36
  • 5
    This question involves combining marks. This feature allows text to contain composite glyphs in more than one natural way. Text may contain characters with accents can represent those graphemes in terms of an unaccented character followed by a combining mark that adds the accent to the previous character, or by a single codepoint, if there is one, for the accented character. This is totally different from surrogates in UTF-16, combining marks apply equally to all unicode encodings. – SingleNegationElimination Jan 02 '11 at 16:57
  • ICU's CharacterIterator, I think. Like StringCharacterIterator. – Hans Passant Jan 02 '11 at 17:00
  • (Ugh "comprised of". http://en.wiktionary.org/wiki/comprise#Verb) – aschepler Jan 02 '11 at 17:53
  • `UnicodeString(s).length()` returns 6 because the string consist of six 16-bit code units. – dalle Jan 02 '11 at 20:00
  • `const UnicodeString str(L"नमस्ते")` will probably only work when `UChar` is `wchar_t`. – abergmeier Apr 27 '14 at 09:46
  • Please note: starting in C++11, you should favor `u"..."` strings over `L"..."` strings: `const UnicodeString str(u"नमस्ते")` – sffc Jul 07 '22 at 16:58
  • ICU's `StringCharacterIterator` can operate on [either code points or code units](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1ForwardCharacterIterator.html#details). _Code points_ are the Unicode characters whereas _code units_ are the storage units, i.e. bytes for utf-8. You're using `StringCharacterIterator::next()` which returns the code unit, whereas [`StringCharacterIterator::next32()`](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1UCharCharacterIterator.html#ab4e41dae5d4ae832b473ebf09d98d48b) would return the Unicode code point. – decocijo Nov 27 '22 at 17:03

3 Answers3

13

You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).

McDowell
  • 107,573
  • 31
  • 204
  • 267
2

Glib's ustring class gives you utf-8 strings, if using utf-8 is ok for you. It is designed to be similar to std::string. Since utf-8 is native for Linux, your task is quite easy:

int main()
{
    Glib::ustring s = L"नमस्ते";
    cout << s.size();
}

you can also iterate on string's characters as usual with Glib::ustring::iterator

davka
  • 13,974
  • 11
  • 61
  • 86
  • 4
    This will eliminate surrogate pair issues, but doesn't deal with combining characters at all. – aschepler Jan 02 '11 at 16:58
  • 1
    @aschepler: could you explain what you mean by combining characters please? – davka Jan 02 '11 at 17:06
  • 2
    http://en.wikipedia.org/wiki/Combining_character . The example string here has 6 code points (no surrogate pairs are involved). 2 of them are combining characters, so the string has 4 graphemes. @kizzx2 wants to iterate over those graphemes. – aschepler Jan 02 '11 at 17:14
0

ICU has a very old interface, Boost.Locale is much better:

#include <iostream>
#include <string_view>

#include <boost/locale.hpp>

using namespace std::string_view_literals;

int main()
{
    boost::locale::generator gen;
    auto string = "noël "sv;
    boost::locale::boundary::csegment_index map{
        boost::locale::boundary::character, std::begin(string),
        std::end(string), gen("")};
    for (const auto& i : map)
    {
        std::cout << i << '\n';
    }
}

Text is from here