I have the task of counting the number of perceived characters in an input. The input is a group of ints (we can think of it as an int[]
) which represents Unicode code points.
java.text.BreakIterator.getCharacterInstance() is not allowed. (I mean their formula is allowed and is what I wanted, but weaving through their source code and state tables got me nowhere >.<)
I was wondering what's the correct algorithm to count the number of grapheme-clusters given some code points?
Initially, I'd thought that all I have to do is to combine all occurences of:
U+0300 – U+036F
(combining diacritical marks)U+1DC0 – U+1DFF
(combining diacritical marks supplement)U+20D0 – U+20FF
(combining diacritical marks for symbols)U+FE20 - U+FE2F
(combining half marks)
into the previous non-diacritic-mark.
However I've realised that prior to that operation, I have to first remove all non-characters as well.
This includes:
U+FDD0 - U+FDEF
The last two code points of every plane
But there seems to be more things to do. Unicode.org states we need to include U+200C
(zero-width non joiner) and U+200D
(zero width joiner) as part of the set of continuing characters (source).
Besides that, it talks about a couple more things but the entire topic is treated in an abstract way. For example, what are the code point ranges for spacing combining marks, hangul jamo characters that forms hangul syllables?
Does anyone know the correct algorithm to count the number of grapheme-clusters given an int[]
of code points?