How to find the frequency of graphemes in non-ascii string?

Question

I need to find the frequency of graphemes in the unicode encoded string. Consider the input

String[] input = new String[]{"人物","Χαρακτήρες", "पात्र", "எழுத்துக்குறிகள்", "キャラクター"};

I'm using Character.isUnicodeIdentifierStart(int codePoint) API to check whether a new letter has started. Will this work for all languages ? Is this prone to error in some languages ? is there any other better way to find start and end of a letter in Unicode strings ?

import java.util.*;
class Solution {
    public Map<String, Integer> findFrequency (String text) {
        
        Map<String, Integer> counts = new HashMap<>();
        
        int start = 0;
        for (int index = 1; index < text.length(); index++) {
            if ( Character.isUnicodeIdentifierStart(text.codePointAt(index)) ) {// if the current index is a valid start of a new unicode character then increase the frequency of the last seen character
                String unicodeChar = text.substring(start, index);
                counts.put(unicodeChar, counts.getOrDefault(unicodeChar, 0) + 1);
                start = index;
            }
        }
        
        String unicodeChar = text.substring(start, text.length());
        counts.put(unicodeChar, counts.getOrDefault(unicodeChar, 0) + 1);
        
        return counts;
    }
}

For Example Take the fifth visible letter க் from "எழுத்துக்குறிகள்". It should be counted as one instead of க and ் counted separately which when combined forms the letter க்.

So you are not looking at Unicode characters, but at grapheme cluster right? The problem: it is outside Unicode scope (and it varies by language and script type, and country, and historical period, etc.). You should just normalize the string (unicode normalize), and then look the characters (one, two, or more) at a time and fill the table. If I remember correctly, you are lucky: Indic languages doesn't requires looking on both ways. Maybe Unicode standard or OpenFont standard (by Microsoft) have more information about this) — Giacomo Catenazzi, Aug 12 '21 at 12:12
@GiacomoCatenazzi yes you are correct. I need grapheme clusters. — Veera Kumar, Aug 12 '21 at 12:32
[1] Your question is almost a duplicate of [What's the correct algorithm to determine number of user-perceived-characters?](https://stackoverflow.com/q/9097572/2985643), but since that question is over nine years old with no accepted answer, I'm loathe to vote to close. [2] Also see [How to count grapheme clusters or “perceived” emoji characters in Java](https://stackoverflow.com/q/40878804/2985643). — skomisa, Aug 13 '21 at 06:00
I think the wording of your question is potentially misleading and ambiguous. Rather than counting _unicode characters_ you actually want to count _perceived characters_ which may consist of multiple code points represented by _grapheme clusters_, right? If so, can you update your question and its title to be more precise and focused? — skomisa, Aug 13 '21 at 06:17
@skomisa I have changed the title now. Thanks for those links. Will check those. — Veera Kumar, Aug 13 '21 at 10:49

Andy Turner · Answer 1 · 2021-08-12T11:31:38.453

2

Use CharSequence.codePoints() to get a stream of unicode codepoints; then group that:

Map<String, Long> frequencies =
    text.codePoints()
        .mapToObj(i -> new String(new int[]{i}, 0, 1)
        .collect(Collectors.groupingBy(a -> a, Collectors.counting());

Alternatively, easier: because you want String keys, you can simply split the string into codepoints, then collect in the same way:

Map<String, Long> frequencies =
    Arrays.stream(text.split(""))
        .collect(Collectors.groupingBy(a -> a, Collectors.counting());

edited Aug 12 '21 at 11:31

answered Aug 12 '21 at 11:22

Andy Turner

137,514
11
162
243

That will work until they extend unicode to 128 bit. – talex Aug 12 '21 at 11:37
Hi, your code give frequency of every code point. What i want is the frequency of combined meaningful letters. For Example : க் consumes two codepoints. I Want them to be considered as single item. By considering each codepoint we are counting க and ் separately which will not provide expected results. – Veera Kumar Aug 12 '21 at 11:46
@talex Why do you think it works now? This answer ignores the issue of grapheme clustering. The OP explicitly gave an example of the problem in the question: _"For Example Take the fifth visible letter க் from "எழுத்துக்குறிகள்". **It should be counted as one** instead of க and ் counted separately which when combined forms the letter க்."_ – skomisa Aug 16 '21 at 16:34
@skomisa It was a joke. Currently it uses `long` to represent code point. It have only 64 bits. It is enough for now, but in highly hypothetical future Unicode could be extended beyond that. – talex Aug 16 '21 at 16:38
@skomisa this answer was written before OP edited to use the word "grapheme", or show the example at the end. – Andy Turner Aug 16 '21 at 16:53

score 0 · Answer 2 · answered Aug 16 '21 at 02:34

First, a few general points:

I suspect that hand coding your requirement for all possible cases is non-trivial. For example, how does Character.isUnicodeIdentifierStart() handle right-to-left Arabic text, and how should you handle data which is meaningless (i.e. not valid Unicode?).
Therefore, use existing libraries instead, that have (hopefully!) already catered for such issues. The JDK class java.text.BreakIterator should do exactly what you want, and there is helpful documentation on its use in Oracle's Java Tutorials, in the Detecting Text Boundaries section.
Also, the Unicode Technical Report UNICODE TEXT SEGMENTATION goes into great detail on how to process graphemes. See section 3 Grapheme Cluster Boundaries.
Though not mentioned in your question, it is important to specify a language for the text being processed using a Locale, since some boundary rules are language dependent.

Here's code that counts the graphemes of the sample data provided in the OP, plus some Arabic text, using the BreakIterator class:

package graphemecounter;
import java.text.BreakIterator;
import java.util.Locale;

public class GraphemeCounter {

    public static void main(String[] args) {
        // Declare the texts  to be be processed.
        String houseInArabic = "\u0628" + "\u064e" + "\u064a" + "\u0652" + "\u067a" + "\u064f";
        String[] input = new String[]{"人物", "Χαρακτήρες", "पात्र", "எழுத்துக்குறிகள்", "キャラクター", "க்", houseInArabic};//
        
        // Associate a locale with each of the texts to be processed.
        Locale[] locales = new Locale[] { 
            Locale.CHINESE,
            new Locale.Builder().setLanguage("gr").setRegion("GR").build(),
            new Locale.Builder().setLanguage("hi").setRegion("IN").build(),
            new Locale.Builder().setLanguage("ta").setRegion("IN").build(),
            Locale.JAPANESE,
            new Locale.Builder().setLanguage("ta").setRegion("IN").build(),
            new Locale.Builder().setLanguage("ar").build()
        };

        for (int i = 0; i < input.length; i++) {
            int count = GraphemeCounter.getGraphemesFromText(locales[i], input[i]);
            System.out.println("Grapheme count for [" + input[i] + "] is " + count);
            System.out.println("=======================================");
        }
    }

    public static int getGraphemesFromText(Locale loc, String text) {
        System.out.println("Sample data: " + text);
        BreakIterator bi = BreakIterator.getCharacterInstance(loc);
        bi.setText(text);
        int graphemeCount = 0;
        int prev;
        int next = bi.first();

        while (next != BreakIterator.DONE) {
            prev = next;
            next = bi.next();
            if (next != BreakIterator.DONE) { 
                graphemeCount++;
                String grapheme = text.substring(prev, next);
                System.out.println("Boundary detected: prev=" + prev + ", next=" + next + ", grapheme=[" + grapheme + "]");
            }
        }
        return graphemeCount; // Amend to return a list of graphemes instead, to get a total for each grapheme.
    }
}

Here's the output from running that code:

run:
Sample data: 人物
Boundary detected: prev=0, next=1, grapheme=[人]
Boundary detected: prev=1, next=2, grapheme=[物]
Grapheme count for [人物] is 2
=======================================
Sample data: Χαρακτήρες
Boundary detected: prev=0, next=1, grapheme=[Χ]
Boundary detected: prev=1, next=2, grapheme=[α]
Boundary detected: prev=2, next=3, grapheme=[ρ]
Boundary detected: prev=3, next=4, grapheme=[α]
Boundary detected: prev=4, next=5, grapheme=[κ]
Boundary detected: prev=5, next=6, grapheme=[τ]
Boundary detected: prev=6, next=7, grapheme=[ή]
Boundary detected: prev=7, next=8, grapheme=[ρ]
Boundary detected: prev=8, next=9, grapheme=[ε]
Boundary detected: prev=9, next=10, grapheme=[ς]
Grapheme count for [Χαρακτήρες] is 10
=======================================
Sample data: पात्र
Boundary detected: prev=0, next=2, grapheme=[पा]
Boundary detected: prev=2, next=5, grapheme=[त्र]
Grapheme count for [पात्र] is 2
=======================================
Sample data: எழுத்துக்குறிகள்
Boundary detected: prev=0, next=1, grapheme=[எ]
Boundary detected: prev=1, next=2, grapheme=[ழ]
Boundary detected: prev=2, next=3, grapheme=[ு]
Boundary detected: prev=3, next=5, grapheme=[த்]
Boundary detected: prev=5, next=6, grapheme=[த]
Boundary detected: prev=6, next=7, grapheme=[ு]
Boundary detected: prev=7, next=9, grapheme=[க்]
Boundary detected: prev=9, next=10, grapheme=[க]
Boundary detected: prev=10, next=11, grapheme=[ு]
Boundary detected: prev=11, next=12, grapheme=[ற]
Boundary detected: prev=12, next=13, grapheme=[ி]
Boundary detected: prev=13, next=14, grapheme=[க]
Boundary detected: prev=14, next=16, grapheme=[ள்]
Grapheme count for [எழுத்துக்குறிகள்] is 13
=======================================
Sample data: キャラクター
Boundary detected: prev=0, next=1, grapheme=[キ]
Boundary detected: prev=1, next=2, grapheme=[ャ]
Boundary detected: prev=2, next=3, grapheme=[ラ]
Boundary detected: prev=3, next=4, grapheme=[ク]
Boundary detected: prev=4, next=5, grapheme=[タ]
Boundary detected: prev=5, next=6, grapheme=[ー]
Grapheme count for [キャラクター] is 6
=======================================
Sample data: க்
Boundary detected: prev=0, next=2, grapheme=[க்]
Grapheme count for [க்] is 1
=======================================
Sample data: بَيْٺُ
Boundary detected: prev=0, next=2, grapheme=[بَ]
Boundary detected: prev=2, next=4, grapheme=[يْ]
Boundary detected: prev=4, next=6, grapheme=[ٺُ]
Grapheme count for [بَيْٺُ] is 3
=======================================
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:

I used font Arial Unicode MS for both the code and the output. It was the only one I could find that supported all of those alphabets.
There are alternative ways to solve this issue, including using third party libraries and regular expressions, but this approach is the simplest.

That was really helpful. Thanks a lot. – Veera Kumar Aug 16 '21 at 07:36 — Veera Kumar, Aug 16 '21 at 07:36

How to find the frequency of graphemes in non-ascii string?

2 Answers2