First, a few general points:
- I suspect that hand coding your requirement for all possible cases is non-trivial. For example, how does
Character.isUnicodeIdentifierStart()
handle right-to-left Arabic text, and how should you handle data which is meaningless (i.e. not valid Unicode?).
- Therefore, use existing libraries instead, that have (hopefully!) already catered for such issues. The JDK class
java.text.BreakIterator
should do exactly what you want, and there is helpful documentation on its use in Oracle's Java Tutorials, in the Detecting Text Boundaries section.
- Also, the Unicode Technical Report UNICODE TEXT SEGMENTATION goes into great detail on how to process graphemes. See section 3 Grapheme Cluster Boundaries.
- Though not mentioned in your question, it is important to specify a language for the text being processed using a
Locale
, since some boundary rules are language dependent.
Here's code that counts the graphemes of the sample data provided in the OP, plus some Arabic text, using the BreakIterator
class:
package graphemecounter;
import java.text.BreakIterator;
import java.util.Locale;
public class GraphemeCounter {
public static void main(String[] args) {
// Declare the texts to be be processed.
String houseInArabic = "\u0628" + "\u064e" + "\u064a" + "\u0652" + "\u067a" + "\u064f";
String[] input = new String[]{"人物", "Χαρακτήρες", "पात्र", "எழுத்துக்குறிகள்", "キャラクター", "க்", houseInArabic};//
// Associate a locale with each of the texts to be processed.
Locale[] locales = new Locale[] {
Locale.CHINESE,
new Locale.Builder().setLanguage("gr").setRegion("GR").build(),
new Locale.Builder().setLanguage("hi").setRegion("IN").build(),
new Locale.Builder().setLanguage("ta").setRegion("IN").build(),
Locale.JAPANESE,
new Locale.Builder().setLanguage("ta").setRegion("IN").build(),
new Locale.Builder().setLanguage("ar").build()
};
for (int i = 0; i < input.length; i++) {
int count = GraphemeCounter.getGraphemesFromText(locales[i], input[i]);
System.out.println("Grapheme count for [" + input[i] + "] is " + count);
System.out.println("=======================================");
}
}
public static int getGraphemesFromText(Locale loc, String text) {
System.out.println("Sample data: " + text);
BreakIterator bi = BreakIterator.getCharacterInstance(loc);
bi.setText(text);
int graphemeCount = 0;
int prev;
int next = bi.first();
while (next != BreakIterator.DONE) {
prev = next;
next = bi.next();
if (next != BreakIterator.DONE) {
graphemeCount++;
String grapheme = text.substring(prev, next);
System.out.println("Boundary detected: prev=" + prev + ", next=" + next + ", grapheme=[" + grapheme + "]");
}
}
return graphemeCount; // Amend to return a list of graphemes instead, to get a total for each grapheme.
}
}
Here's the output from running that code:
run:
Sample data: 人物
Boundary detected: prev=0, next=1, grapheme=[人]
Boundary detected: prev=1, next=2, grapheme=[物]
Grapheme count for [人物] is 2
=======================================
Sample data: Χαρακτήρες
Boundary detected: prev=0, next=1, grapheme=[Χ]
Boundary detected: prev=1, next=2, grapheme=[α]
Boundary detected: prev=2, next=3, grapheme=[ρ]
Boundary detected: prev=3, next=4, grapheme=[α]
Boundary detected: prev=4, next=5, grapheme=[κ]
Boundary detected: prev=5, next=6, grapheme=[τ]
Boundary detected: prev=6, next=7, grapheme=[ή]
Boundary detected: prev=7, next=8, grapheme=[ρ]
Boundary detected: prev=8, next=9, grapheme=[ε]
Boundary detected: prev=9, next=10, grapheme=[ς]
Grapheme count for [Χαρακτήρες] is 10
=======================================
Sample data: पात्र
Boundary detected: prev=0, next=2, grapheme=[पा]
Boundary detected: prev=2, next=5, grapheme=[त्र]
Grapheme count for [पात्र] is 2
=======================================
Sample data: எழுத்துக்குறிகள்
Boundary detected: prev=0, next=1, grapheme=[எ]
Boundary detected: prev=1, next=2, grapheme=[ழ]
Boundary detected: prev=2, next=3, grapheme=[ு]
Boundary detected: prev=3, next=5, grapheme=[த்]
Boundary detected: prev=5, next=6, grapheme=[த]
Boundary detected: prev=6, next=7, grapheme=[ு]
Boundary detected: prev=7, next=9, grapheme=[க்]
Boundary detected: prev=9, next=10, grapheme=[க]
Boundary detected: prev=10, next=11, grapheme=[ு]
Boundary detected: prev=11, next=12, grapheme=[ற]
Boundary detected: prev=12, next=13, grapheme=[ி]
Boundary detected: prev=13, next=14, grapheme=[க]
Boundary detected: prev=14, next=16, grapheme=[ள்]
Grapheme count for [எழுத்துக்குறிகள்] is 13
=======================================
Sample data: キャラクター
Boundary detected: prev=0, next=1, grapheme=[キ]
Boundary detected: prev=1, next=2, grapheme=[ャ]
Boundary detected: prev=2, next=3, grapheme=[ラ]
Boundary detected: prev=3, next=4, grapheme=[ク]
Boundary detected: prev=4, next=5, grapheme=[タ]
Boundary detected: prev=5, next=6, grapheme=[ー]
Grapheme count for [キャラクター] is 6
=======================================
Sample data: க்
Boundary detected: prev=0, next=2, grapheme=[க்]
Grapheme count for [க்] is 1
=======================================
Sample data: بَيْٺُ
Boundary detected: prev=0, next=2, grapheme=[بَ]
Boundary detected: prev=2, next=4, grapheme=[يْ]
Boundary detected: prev=4, next=6, grapheme=[ٺُ]
Grapheme count for [بَيْٺُ] is 3
=======================================
BUILD SUCCESSFUL (total time: 0 seconds)
Notes:
- I used font Arial Unicode MS for both the code and the output. It was the only one I could find that supported all of those alphabets.
- There are alternative ways to solve this issue, including using third party libraries and regular expressions, but this approach is the simplest.