JDK 15 added support for extended grapheme clusters to the java.util.regex
package. Here’s a solution based on that:
/** Returns the number of grapheme clusters within `text` between positions
* `start` and `end`. Omits any partial cluster at the end of the span.
*/
int columnarSpan( String text, int start, int end ) {
return columnarSpan( text, start, end, /*wholeOnly*/true ); }
/** @param wholeOnly Whether to omit any partial cluster at the end
* of the span. Iff `true` and `end` bisects the final cluster,
* then the final cluster is omitted from the count.
*/
int columnarSpan( final String text, final int start, final int end,
final boolean wholeOnly ) {
graphemeMatcher.reset( text ).region( start, end );
int count = 0;
while( graphemeMatcher.find() ) ++count;
if( wholeOnly && count > 0 && end < text.length() ) {
final int countNext = columnarSpan( text, start, end + 1, false );
if( countNext == count ) --count; } /* The character at `end` bisects
the final cluster, which therefore lies partly outside the span.
Therefore exclude it from the count. */
return count; }
final Matcher graphemeMatcher = graphemePattern.matcher( "" );
/** The pattern of a grapheme cluster.
*/
public static final Pattern graphemePattern = Pattern.compile( "\\X" ); } /*
An alternative means of cluster discovery is `java.txt.BreakIterator`.
Long outdated in this regard, [https://bugs.openjdk.org/browse/JDK-8174266]
it was updated for JDK 20. [https://stackoverflow.com/a/76109241/2402790] */
Call it like this:
String emoji = "";
int count = columnarSpan( emoji, 0, /*end*/emoji.length() );
System.out.println( count );
⇒ 2
Note that it counts whole clusters only. If the given end
bisects the final cluster — the character at position end
being part of the same extended cluster as the preceding character — then the final cluster is omitted from the count. For example:
int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1 );
System.out.println( count );
⇒ 1
This is generally the behaviour you want in order to print a line of text with a character pointer positioned beneath it (e.g. ‘^
’) pointing into the cluster of the character at the given index. To defeat this behaviour (pointing after the cluster), call the base method as follows.
int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1, false );
System.out.println( count );
⇒ 2
(Updated as per Skomisa’s comment.)