17

I have the task of counting the number of perceived characters in an input. The input is a group of ints (we can think of it as an int[]) which represents Unicode code points.

java.text.BreakIterator.getCharacterInstance() is not allowed. (I mean their formula is allowed and is what I wanted, but weaving through their source code and state tables got me nowhere >.<)

I was wondering what's the correct algorithm to count the number of grapheme-clusters given some code points?

Initially, I'd thought that all I have to do is to combine all occurences of:

  1. U+0300 – U+036F (combining diacritical marks)

  2. U+1DC0 – U+1DFF (combining diacritical marks supplement)

  3. U+20D0 – U+20FF (combining diacritical marks for symbols)

  4. U+FE20 - U+FE2F (combining half marks)

into the previous non-diacritic-mark.

However I've realised that prior to that operation, I have to first remove all non-characters as well.

This includes:

  1. U+FDD0 - U+FDEF

  2. The last two code points of every plane

But there seems to be more things to do. Unicode.org states we need to include U+200C (zero-width non joiner) and U+200D (zero width joiner) as part of the set of continuing characters (source).

Besides that, it talks about a couple more things but the entire topic is treated in an abstract way. For example, what are the code point ranges for spacing combining marks, hangul jamo characters that forms hangul syllables?

Does anyone know the correct algorithm to count the number of grapheme-clusters given an int[] of code points?

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
Pacerier
  • 86,231
  • 106
  • 366
  • 634
  • In [this answer](http://stackoverflow.com/a/6806203/68063) I gave a reference to the Unicode algorithm for splitting text into grapheme clusters, and demonstrated how to implement some of it in Python. – Gareth Rees Feb 01 '12 at 14:36
  • @GarethRees I think that there's many important steps missing in that answer to do a correct *user-perceived-character* check (correct me if I'm wrong). – Pacerier Feb 01 '12 at 14:48
  • Can you use the ICU `BreakIterator`? http://icu-project.org/apiref/icu4j/com/ibm/icu/text/BreakIterator.html – Joni Feb 01 '12 at 16:31
  • @JoniSalonen No *higher-level* functions which masks the implementation details, that would defeat the purpose, unless of course they could show their algorithm-breakdown. – Pacerier Feb 01 '12 at 16:34
  • Don't forget the "Regional Indicator Symbols", U+1F1E6 to U+1F1FF, those are supposed to be combined in pairs to make flag emojis (e.g. "R" + "U" for a Russian flag), but java.text.BreakIterator.getCharacterInstance() at least returns the two code points separately. – Stefan L May 23 '14 at 12:57
  • Grapheme clusters are defined at http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries – Stefan L May 23 '14 at 13:08
  • ICU4J's com.ibm.icu.text.BreakIterator _does_ support regional indicator symbol pairs unlike the plain java.text.BreakIterator from Java 7. – Stefan L May 23 '14 at 13:45

1 Answers1

3

There's not a single canonical method appropriate to all uses, but a good starting point is the Unicode Grapheme Cluster Boundary algorithm on the Unicode.org page you link to. Basically, Unicode provides a database of each code point's grapheme break property, and then describes an algorithm to decide if a grapheme break is allowed between two code points based on their assigned grapheme break properties.

Here's part of an implementation (in C++) I played around with a while ago:

bool BoundaryAllowed(char32_t cp, char32_t cp2) {
  // lbp: left break property; rbp: right break property
  auto lbp = get_property_for_codepoint(cp),
       rbp = get_property_for_codepoint(cp2);

  // Do not break between a CR and LF. Otherwise, break before and after
  // controls.
  if ((CR == lbp && LF == rbp)) {
    // The Unicode grapheme boundary algorithm does not handle LFCR new lines
    return false;
  }

  if (Control == lbp || CR == lbp || LF == lbp || Control == rbp || CR == rbp ||
      LF == rbp) {
    return true;
  }

  // Do not break Hangul syllable sequences.
  if ((L == lbp && (L == rbp || V == rbp || LV == rbp || LVT == rbp)) ||
      ((LV == lbp || V == lbp) && (V == rbp || T == rbp)) ||
      ((LVT == lbp || T == lbp) && (T == rbp))) {
    return false;
  }

  // Do not break before extending characters.
  if (Extend == rbp) {
    return false;
  }

  // Do not break before SpacingMarks, or after Prepend characters.
  if (Prepend == lbp || SpacingMark == rbp) {
    return false;
  }

  return true; // Otherwise, break everywhere.
}

In order to obtain the ranges for the different types of codepoints you'll just have to look at the Unicode Character Database. The file with the grapheme break properties, which describes them in terms of the ranges, is about 1200 lines long: http://www.unicode.org/Public/6.1.0/ucd/auxiliary/

I'm not really sure how much value there is in ignoring non-character code points, but if your use requires it then you'll add that in to your implementation.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • Do you understand why is it that `U+0599` is listed in Mark/NonSpace Combining category however isn't one of the combining characters? Basically when determining grapheme clusters, should we use the data from combining character or for break properties, or both? – Pacerier Feb 01 '12 at 17:01
  • I think the combining character ranges you listed don't include all combining characters for all languages. It looks like they're only ones that are pretty general and apply across languages while U+0599 is specific to Hebrew. I would use the grapheme break properties since it should be a comprehensive list, while if you only look at Unicode blocks whose names includes "COMBINING" it looks like you'll miss ~90% of them. – bames53 Feb 01 '12 at 17:35