19

I'm looking to count the number of perceived emoji characters in a provided Java string. I'm currently using the emoji4j library, but it doesn't work for grapheme clusters like this one: ‍‍‍

Calling EmojiUtil.getLength("‍‍‍") returns 4 instead of 1, and similarly calling EmojiUtil.getLength("‍‍‍") returns 5 instead of 2.

Are there any APIs or methods on String in Java that make it easy to count grapheme clusters?

I've been hunting around but understandably the codePoints() method on a String includes not only the visible emojis, but also the zero width joiners.

I also attempted this using the BreakIterator:

public static int getLength(String emoji) {
    BreakIterator it = BreakIterator.getCharacterInstance();
    it.setText(emoji);
    int emojiCount = 0;
    while (it.next() != BreakIterator.DONE) {
        emojiCount++;
    }
    return emojiCount;
}

But it seems to behave identically to the codePoints() method, returning 8 for something like "‍‍‍".

Craig Otis
  • 31,257
  • 32
  • 136
  • 234
  • Interesting topic. I tried to find out what kind of character this is (your first example) and I'm wondering if the combinations that are considered these combined emojis are real unicode standards or conventions adopted by vendors. Your first example is a combination of the unicode characters for Woman, Woman, Boy, Boy combined with Zero-width joiners. http://emojipedia.org/emoji/%F0%9F%91%A9%E2%80%8D%F0%9F%91%A9%E2%80%8D%F0%9F%91%A6%E2%80%8D%F0%9F%91%A6/ – Erwin Bolwidt Nov 30 '16 at 02:03
  • One way of combining characters into emojis is using the zero-width-joiner codepoint (ZWJ/ U+200D). So one way to get the count of visible characters is to go over all unicode codepoints and whenever you encounter the ZWJ, you substract two (for the ZWJ and for the next character which is merged into the previous character). However there are more ways to compose emojis (and unicode characters) so your best bet is to wait for emoji4j to update or to do it yourself. – Erwin Bolwidt Nov 30 '16 at 03:05
  • Possible duplicate of [What's the correct algorithm to determine number of user-perceived-characters?](http://stackoverflow.com/questions/9097572/whats-the-correct-algorithm-to-determine-number-of-user-perceived-characters) – Erwin Bolwidt Nov 30 '16 at 04:58
  • Doesn't look like Java supports counting Grapheme Clusters (perceived characters). So the above question/answer should still be valid. – Erwin Bolwidt Nov 30 '16 at 04:59
  • It's a bit different - that question (and some of the comments on it) are avoiding the use of any higher-level functions or third-party bits like the ICU library. (They want to go from `int[]` to emoji count.) I'm just working with Strings and happy to use whatever resources are available. In preliminary testing it looks like the ICU library might work - I'll make sure then add an answer. – Craig Otis Nov 30 '16 at 13:22
  • I tried - didn't know about the ICU library - and it works, you can just use your code above that uses BreakIterator, as long as use ICU's version of BreakIterator rather than the java API version. Surprising. I'm reading that even JDK9 is planned to support only Unicode version 8 rather than 9 which is the newest version. So ICU is the way forward. – Erwin Bolwidt Nov 30 '16 at 14:23

3 Answers3

10

I ended up using the ICU library, which worked much better. No changes (aside from import statements) were needed from my original codeblock, as it simply provides a different implementation of BreakIterator.

Craig Otis
  • 31,257
  • 32
  • 136
  • 234
3

JDK 15 added support for extended grapheme clusters to the java.util.regex package. Here’s a solution based on that:

/** Returns the number of grapheme clusters within `text` between positions
  * `start` and `end`.  Omits any partial cluster at the end of the span.
  */
int columnarSpan( String text, int start, int end ) {
    return columnarSpan( text, start, end, /*wholeOnly*/true ); }


/** @param wholeOnly Whether to omit any partial cluster at the end
  *   of the span.  Iff `true` and `end` bisects the final cluster,
  *   then the final cluster is omitted from the count.
  */
int columnarSpan( final String text, final int start, final int end,
      final boolean wholeOnly ) {
    graphemeMatcher.reset( text ).region( start, end );
    int count = 0;
    while( graphemeMatcher.find() ) ++count;
    if( wholeOnly  &&  count > 0  &&  end < text.length() ) {
        final int countNext = columnarSpan( text, start, end + 1, false );
        if( countNext == count ) --count; } /* The character at `end` bisects
          the final cluster, which therefore lies partly outside the span.
          Therefore exclude it from the count. */
    return count; }


final Matcher graphemeMatcher = graphemePattern.matcher( "" );


/** The pattern of a grapheme cluster.
  */
public static final Pattern graphemePattern = Pattern.compile( "\\X" ); } /*
  An alternative means of cluster discovery is `java.txt.BreakIterator`.
  Long outdated in this regard,  [https://bugs.openjdk.org/browse/JDK-8174266]
  it was updated for JDK 20.  [https://stackoverflow.com/a/76109241/2402790] */

Call it like this:

String emoji = "‍‍‍";
int count = columnarSpan( emoji, 0, /*end*/emoji.length() );
System.out.println( count );

⇒ 2

Note that it counts whole clusters only. If the given end bisects the final cluster — the character at position end being part of the same extended cluster as the preceding character — then the final cluster is omitted from the count. For example:

int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1 );
System.out.println( count );

⇒ 1

This is generally the behaviour you want in order to print a line of text with a character pointer positioned beneath it (e.g. ‘^’) pointing into the cluster of the character at the given index. To defeat this behaviour (pointing after the cluster), call the base method as follows.

int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1, false );
System.out.println( count );

⇒ 2

(Updated as per Skomisa’s comment.)

Michael Allan
  • 3,731
  • 23
  • 31
  • 1
    FYI, the missing support for grapheme clusters with `BreakIterator` that you mention has finally been implemented in Java 20. See [Grapheme support in BreakIterator](https://bugs.openjdk.org/browse/JDK-8291660) – skomisa Apr 26 '23 at 09:39
3

More than six years after this question was asked, an enhancement to properly process grapheme clusters within a String was finally implemented in Java 20, which was released a few weeks ago. See JDK-8291660 Grapheme support in BreakIterator.

There is no change to the API of the BreakIterator class, but its underlying code now correctly treats a grapheme cluster as a single unit rather than multiple characters.

Here is a sample application, using the method and data provided in the question without any changes:

import java.nio.charset.Charset;
import java.text.BreakIterator;

public class Main {

    public static void main(String[] args) throws java.io.UnsupportedEncodingException {
        System.out.println("System.getProperty(\"java.version\"): " + System.getProperty("java.version"));
        System.out.println("Charset.defaultCharset():" + Charset.defaultCharset());
        Main.printStringInfo("‍‍‍");
        Main.printStringInfo("‍‍‍");
    }

    static void printStringInfo(String s) {
        System.out.print("\nCode points for the String " + s + ":");
        s.codePoints().mapToObj(Integer::toHexString).forEach(x -> System.out.print(x + " "));
        System.out.println("\nThe length of the String " + s + " using String.length() is " + s.length());
        System.out.println("The length of the String " + s + " using BreakIterator is " + Main.getLength(s));
    }

    // Returns the correct number of perceived characters in a String.
    // Requires JDK 20+ to work correctly.
    // Earlier Java releases will incorrectly just count the code points instead.
    // JDK-8291660 "Grapheme support in BreakIterator" (https://bugs.openjdk.org/browse/JDK-8291660) refers.
    public static int getLength(String emoji) {
        BreakIterator it = BreakIterator.getCharacterInstance();
        it.setText(emoji);
        int count = 0;
        while (it.next() != BreakIterator.DONE) {
            count++;
        }
        return count;
    }
}

Here is the output, showing the correct grapheme counts (1 and 2) when using JDK 20:

C:\Java\jdk-20\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\lib\idea_rt.jar=53642:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\bin -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath D:\II2023.1\Graphemes\out\production\Graphemes Main
System.getProperty("java.version"): 20-ea
Charset.defaultCharset():UTF-8

Code points for the String ‍‍‍:1f469 200d 1f469 200d 1f466 200d 1f466 
The length of the String ‍‍‍ using String.length() is 11
The length of the String ‍‍‍ using BreakIterator is 1

Code points for the String ‍‍‍:1f47b 1f469 200d 1f469 200d 1f466 200d 1f466 
The length of the String ‍‍‍ using String.length() is 13
The length of the String ‍‍‍ using BreakIterator is 2

Process finished with exit code 0

And here is the output for the identical code showing incorrect grapheme counts (7 and 8) when using JDK 17:

C:\Java\jdk-17.0.2\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\lib\idea_rt.jar=53775:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\232.5150.116\bin -Dfile.encoding=UTF-8 -classpath D:\II2023.1\Graphemes\out\production\Graphemes Main
System.getProperty("java.version"): 17.0.2
Charset.defaultCharset():UTF-8

Code points for the String ‍‍‍:1f469 200d 1f469 200d 1f466 200d 1f466 
The length of the String ‍‍‍ using String.length() is 11
The length of the String ‍‍‍ using BreakIterator is 7

Code points for the String ‍‍‍:1f47b 1f469 200d 1f469 200d 1f466 200d 1f466 
The length of the String ‍‍‍ using String.length() is 13
The length of the String ‍‍‍ using BreakIterator is 8

Process finished with exit code 0

I tested this in Intellij IDEA 2023.1.1 Preview using Oracle OpenJDK version 20.0.1 and Oracle OpenJDK version 17.0.2

skomisa
  • 16,436
  • 7
  • 61
  • 102