This turns out to be really ugly....
I have debugged your string and it contains following characters (and their hex position):
க 0x0b95
ு 0x0bc1
ம 0x0bae
ா 0x0bbe
ர 0x0bb0
் 0x0bcd
So tamil language obviously use diacritics-like sequences to get
all characters which unfortunately count as separate entities.
This is not a problem with UTF-8 / UTF-16 as erronously claimed by
other answers, it is inherent in the Unicode encoding of the Tamil
language.
The suggested Normalizer does not work, it seems that tamil has
been designed by Unicode "experts" to explicitly use combination
sequences which cannot be normalized. Aargh.
My next idea is not to count characters, but glyphs, the visual
representations of characters.
String str1 = new String(Normalizer.normalize("குமார்", Normalizer.Form.NFC ));
Font display = new Font("SansSerif",Font.PLAIN,12);
GlyphVector vec = display.createGlyphVector(new FontRenderContext(new AffineTransform(),false, false),str1);
System.out.println(vec.getNumGlyphs());
for (int i=0; i<str1.length(); i++)
System.out.printf("%s %s %s %n",str1.charAt(i),Integer.toHexString((int) str1.charAt(i)),vec.getGlyphVisualBounds(i).getBounds2D().toString());
The result:
க b95 [x=0.0,y=-6.0,w=7.0,h=6.0]
ு bc1 [x=8.0,y=-6.0,w=7.0,h=4.0]
ம bae [x=17.0,y=-6.0,w=6.0,h=6.0]
ா bbe [x=23.0,y=-6.0,w=5.0,h=6.0]
ர bb0 [x=30.0,y=-6.0,w=4.0,h=8.0]
் bcd [x=31.0,y=-9.0,w=1.0,h=2.0]
As the glyphs are intersecting, you need to use Java character type
functions like in the other solution.
SOLUTION:
I am using this link: http://www.venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf
public static int getTamilStringLength(String tamil) {
int dependentCharacterLength = 0;
for (int index = 0; index < tamil.length(); index++) {
char code = tamil.charAt(index);
if (code == 0xB82)
dependentCharacterLength++;
else if (code >= 0x0BBE && code <= 0x0BC8)
dependentCharacterLength++;
else if (code >= 0x0BCA && code <= 0x0BD7)
dependentCharacterLength++;
}
return tamil.length() - dependentCharacterLength;
}
You need to exclude the combination characters and count them accordingly.