57

I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way.

Here I am trying to get the length of the string str1. I am getting it as 6. But actually it is 3. moving the cursor over the string "குமார்" also shows it as 3 chars.

Basically I want to measure the length and print each character. like "கு", "மா", "ர்" .

 public class one {
    public static void main(String[] args) {
            String str1 = new String("குமார்");
            System.out.print(str1.length());
    }
}

PS : It is tamil language.

Mifeet
  • 12,949
  • 5
  • 60
  • 108
user1611248
  • 708
  • 3
  • 7
  • 13
  • 18
    It doesn't make any difference for the problem, but there's no need to use `new String("...")`, just do: `String str1 = "குமார்";` – Jesper Apr 11 '13 at 11:52
  • 5
    See http://www.venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf for a paper concerning this problem. – halex Apr 11 '13 at 11:55
  • Blog is really very informative. But it doesn't give us an option in java to split the string into three meaningful chars. – user1611248 Apr 11 '13 at 12:11
  • twitter has a very good guide on how they count characters here: https://dev.twitter.com/docs/counting-characters – benathon Apr 11 '13 at 23:34
  • archive links for [the paper about Tamil encoding](https://web.archive.org/web/20061017080834/https://www.venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf) (@halex), the [twitter developer guide](https://web.archive.org/web/20130308034513/https://dev.twitter.com/docs/counting-characters#Java_Specific_Information) (@portforwardpodcast) and [a java code sample](https://web.archive.org/web/20071031003231/http://www.unicode.org/reports/tr15/Normalizer.html) linked from the twitter guide. – Joshua Goldberg Jun 29 '20 at 17:48

5 Answers5

43

Found a solution to your problem.

Based on this SO answer I made a program that uses regex character classes to search for letters that may have optional modifiers. It splits your string into single (combined if necessary) characters and puts them into a list:

import java.util.*;
import java.lang.*;
import java.util.regex.*;

class Main
{
    public static void main (String[] args)
    {
        String s="குமார்";
        List<String> characters=new ArrayList<String>();
        Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
        Matcher matcher = pat.matcher(s);
        while (matcher.find()) {
            characters.add(matcher.group());            
        }

        // Test if we have the right characters and length
        System.out.println(characters);
        System.out.println("String length: " + characters.size());

    }
}

where \\p{L} means a Unicode letter, and \\p{M} means a Unicode mark.

The output of the snippet is:

கு
மா
ர்
String length: 3

See https://ideone.com/Apkapn for a working Demo


EDIT

I now checked my regex with all valid Tamil letters taken from the tables in http://en.wikipedia.org/wiki/Tamil_script. I found out that with the current regex we do not capture all letters correctly (every letter in the last row in the Grantha compound table is splitted into two letters), so I refined my regex to the following solution:

Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");

With this Pattern instead of the above one you should be able to split your sentence into every valid Tamil letter (as long as wikipedia's table is complete).

The code I used for checking is the following one:

String s = "ஃஅஆஇஈஉஊஎஏஐஒஓஔக்ககாகிகீகுகூகெகேகைகொகோகௌங்ஙஙாஙிஙீஙுஙூஙெஙேஙைஙொஙோஙௌச்சசாசிசீசுசூசெசேசைசொசோசௌஞ்ஞஞாஞிஞீஞுஞூஞெஞேஞைஞொஞோஞௌட்டடாடிடீடுடூடெடேடைடொடோடௌண்ணணாணிணீணுணூணெணேணைணொணோணௌத்ததாதிதீதுதூதெதேதைதொதோதௌந்நநாநிநீநுநூநெநேநைநொநோநௌப்பபாபிபீபுபூபெபேபைபொபோபௌம்மமாமிமீமுமூமெமேமைமொமோமௌய்யயாயியீயுயூயெயேயையொயோயௌர்ரராரிரீருரூரெரேரைரொரோரௌல்லலாலிலீலுலூலெலேலைலொலோலௌவ்வவாவிவீவுவூவெவேவைவொவோவௌழ்ழழாழிழீழுழூழெழேழைழொழோழௌள்ளளாளிளீளுளூளெளேளைளொளோளௌற்றறாறிறீறுறூறெறேறைறொறோறௌன்னனானினீனுனூனெனேனைனொனோனௌஶ்ஶஶாஶிஶீஶுஶூஶெஶேஶைஶொஶோஶௌஜ்ஜஜாஜிஜீஜுஜூஜெஜேஜைஜொஜோஜௌஷ்ஷஷாஷிஷீஷுஷூஷெஷேஷைஷொஷோஷௌஸ்ஸஸாஸிஸீஸுஸூஸெஸேஸைஸொஸோஸௌஹ்ஹஹாஹிஹீஹுஹூஹெஹேஹைஹொஹோஹௌக்ஷ்க்ஷக்ஷாக்ஷிக்ஷீக்ஷுக்ஷூக்ஷெக்ஷேக்ஷைஷொக்ஷோஷௌ";
List<String> characters = new ArrayList<String>();
Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");
Matcher matcher = pat.matcher(s);
while (matcher.find()) {
    characters.add(matcher.group());
}

System.out.println(characters);
System.out.println(characters.size() == 325);
Community
  • 1
  • 1
halex
  • 16,253
  • 5
  • 58
  • 67
  • 1
    Yes, I don't know if it handles all the cases which can happen in the Tamil language, but it's definitely elegant. – Mifeet Apr 11 '13 at 13:37
  • 1
    Thank you so much. Yes you are right. only the last row in Grantha table is made of two letters. ie 3 - 4 unicode symbols. The table you had referred in wikipedia is correct. It is the complete list. – user1611248 Apr 12 '13 at 02:42
  • What should be the regex if I have to include punctuation like "_". For example "குமார_கு" should return count 5. – user1611248 May 07 '13 at 13:16
  • 3
    @user1611248 Add `|\\p{P}` to the regex. `\\p{P}` is a punctuation character. See https://ideone.com/NvfDDq – halex May 07 '13 at 14:17
  • Might need more than puncuation. Whitespace/newlines, for instance? – Joshua Goldberg Jun 29 '20 at 18:59
15

Have a look at the Normalizer class. There is an explanation of what may be the cause of your problem. In Unicode, you can encode characters in several ways, e.g Á:

  U+00C1    LATIN CAPITAL LETTER A WITH ACUTE

or

  U+0041    LATIN CAPITAL LETTER A
  U+0301    COMBINING ACUTE ACCENT

You can try to use Normalizer to convert your string to the composed form and then iterate over the characters.


Edit: Based on the article suggested by @halex above, try this in Java:

    String str = new String("குமார்");

    ArrayList<String> characters = new ArrayList<String>();
    str = Normalizer.normalize(str, Form.NFC);
    StringBuilder charBuffer = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        int codePoint = str.codePointAt(i);
        int category = Character.getType(codePoint);
        if (charBuffer.length() > 0
                && category != Character.NON_SPACING_MARK
                && category != Character.COMBINING_SPACING_MARK
                && category != Character.CONTROL
                && category != Character.OTHER_SYMBOL) {
            characters.add(charBuffer.toString());
            charBuffer.delete(0, charBuffer.length());
        }
        charBuffer.appendCodePoint(codePoint);
    }
    if (charBuffer.length() > 0) {
        characters.add(charBuffer.toString());
    }
    System.out.println(characters);

The result I get is [கு, மா, ர்]. If it doesn't work for all your strings, try fiddeling with other Unicode character categories in the if block.

Mifeet
  • 12,949
  • 5
  • 60
  • 108
  • 4
    Tried to normalize the string and measured the length. Still getting it as 6. If browser editor can identify it as 3 character with cursor navigation, dont we have a standard method in java to get it? – user1611248 Apr 11 '13 at 12:10
  • 2
    It is not correct in this case, but a good hint for other problems. +1 – Thorsten S. Apr 11 '13 at 13:05
  • 1
    The article also mentions "KSha", "Sri" and "Ayudham". I guess those will have to be handled as a special case. – Mifeet Apr 11 '13 at 13:21
  • 4
    Normalization is *only* a solution when there's a pre-composed letter for *every* letter in your string. pre-composed letters are *very rare* in Unicode and exist *almost exclusively* in latin alphabets (and mostly for round-trip compatibility with legacy, non-Unicode encodings). – Joachim Sauer Apr 11 '13 at 13:59
  • I thought there could be a problem in ordering of characters. I checked the ordering algorithm and you are right, the normalization was superfluous. – Mifeet Apr 11 '13 at 14:40
  • your `if ... else if` has 3 different ways. However, each one of the three contains the instruction `charBuffer.appendCodePoint(codePoint);`. You really should move that out, and you will find that you only need one condition and one way. – p91paul Apr 11 '13 at 16:28
  • @p91paul: Of course, you're right. That's the result of rewriting code in haste :). Thanks – Mifeet Apr 12 '13 at 20:02
8

This turns out to be really ugly.... I have debugged your string and it contains following characters (and their hex position):

க 0x0b95
ு 0x0bc1
ம 0x0bae
ா 0x0bbe
ர 0x0bb0
் 0x0bcd

So tamil language obviously use diacritics-like sequences to get all characters which unfortunately count as separate entities.

This is not a problem with UTF-8 / UTF-16 as erronously claimed by other answers, it is inherent in the Unicode encoding of the Tamil language.

The suggested Normalizer does not work, it seems that tamil has been designed by Unicode "experts" to explicitly use combination sequences which cannot be normalized. Aargh.

My next idea is not to count characters, but glyphs, the visual representations of characters.

String str1 = new String(Normalizer.normalize("குமார்", Normalizer.Form.NFC ));

Font display = new Font("SansSerif",Font.PLAIN,12);
GlyphVector vec = display.createGlyphVector(new FontRenderContext(new AffineTransform(),false, false),str1);

System.out.println(vec.getNumGlyphs());
for (int i=0; i<str1.length(); i++)
        System.out.printf("%s %s %s %n",str1.charAt(i),Integer.toHexString((int) str1.charAt(i)),vec.getGlyphVisualBounds(i).getBounds2D().toString());

The result:

க b95 [x=0.0,y=-6.0,w=7.0,h=6.0]
ு bc1 [x=8.0,y=-6.0,w=7.0,h=4.0]
ம bae [x=17.0,y=-6.0,w=6.0,h=6.0]
ா bbe [x=23.0,y=-6.0,w=5.0,h=6.0]
ர bb0 [x=30.0,y=-6.0,w=4.0,h=8.0]
் bcd [x=31.0,y=-9.0,w=1.0,h=2.0]

As the glyphs are intersecting, you need to use Java character type functions like in the other solution.

SOLUTION:

I am using this link: http://www.venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf

public static int getTamilStringLength(String tamil) {
    int dependentCharacterLength = 0;
    for (int index = 0; index < tamil.length(); index++) {
        char code = tamil.charAt(index);
        if (code == 0xB82)
            dependentCharacterLength++;
        else if (code >= 0x0BBE && code <= 0x0BC8)
            dependentCharacterLength++;
        else if (code >= 0x0BCA && code <= 0x0BD7)
            dependentCharacterLength++;
    }
    return tamil.length() - dependentCharacterLength;
  }

You need to exclude the combination characters and count them accordingly.

Thorsten S.
  • 4,144
  • 27
  • 41
2

As has been mentioned, your string contains 6 distinct code points. Half of them are letters, the other half are vowel signs. (Combining marks)

You could use transformations built into the ICU4J library, to remove all of the vowel signs which are not Letters using the rule:

[:^Letter:] Remove

and count the resulting string. Try it out on their demo site:

http://demo.icu-project.org/icu-bin/translit

I wouldn't display the resultant string to an end user, and I'm not an expert so the rules may need to be tweaked to get to the general case but it's a thought.

Charlie
  • 7,181
  • 1
  • 35
  • 49
  • 5
    Whether it contains 6 characters or 3 depends entirely on your definition of "character". Unfortunately, that word is not well-defined and used in a variety of incompatible ways. Your statement is only correct if you take "character" to mean "code point". –  Apr 11 '13 at 12:53
0

This is the new way to calculate the length of a Java String taking into account the Unicode characters.

int unicodeLength = str.codePointCount(0, str.length);
jordiburgos
  • 5,964
  • 4
  • 46
  • 80
  • The 3-character Tamil string gives 6 code points, the same result as str.length() if you look at it with `codePointCount()` or `codePoints()`. It may work in other lanuages, however. (I believe this is the intent of code points.) – Joshua Goldberg Jun 29 '20 at 17:48