8

Is there any optimal way to implement character count for non English letters? For example, if we take the word "Mother" in English, it is a 6 letter word. But if you type the same word(மதர்) in Tamil, it is a three letter word(ம+த+ர்) but the last letter(ர்) will be considered as two characters(ர+ஂ=ர்) by the system. So is there any way to count the number of real characters?

One clue is that if we move the cursor in keyboard into the word (மதர்), it will pass through 3 letters only and not into 4 chars considering by the system, so is there any way to find the solution by using this? Any help on this would be greatly appreciated...

Stranger
  • 10,332
  • 18
  • 78
  • 115
  • Some sort of static map lookup? Just out of curiosity, where are you needing this? – Vaibhav Desai Dec 11 '12 at 07:44
  • I guess [this is a related question](http://stackoverflow.com/questions/2315488/using-javascript-how-can-i-count-a-mix-of-asian-characters-and-english-words). Maybe not.. I'm just helping – Ron van der Heijden Dec 11 '12 at 07:47
  • 3
    This is a difficult problem. You might want to look at first doing a Normalization Form 'D' (Cannonical Decomposition), so that seemingly equal strings actually *are* equal. Then check for how many extended grapheme clusters there are. Presumably Javascript has library tools for this. (and if not, it should ;) – DavidO Dec 11 '12 at 07:48

2 Answers2

9

Update

Back from lunch =) I'm afraid that the previous won't work this well with any foreign language So i added another fiddle with a possible way

var UnicodeNsm = [Array 1280] //It holds all escaped Unicode Non Space Marks
function countNSMString(str) {
    var chars = str.split("");
    var count = 0;
    for (var i = 0,ilen = chars.length;i<ilen;i++) {
      if(UnicodeNsm.indexOf(escape(chars[i])) == -1) {
        count++;
       }
    }
    return count;
}

var English = "Mother";  
var Tamil = "மதர்";
var Vietnamese = "mẹ"
var Hindi = "मां"

function logL (str) {    
      console.log(str + " has " + countNSMString(str) + " visible Characters and " + str.length + " normal Characters" ); //"மதர் has 3 visible Characters"
}

logL(English) //"Mother has 6 visible Characters and 6 normal Characters"
logL(Tamil) //"மதர் has 3 visible Characters and 4 normal Characters"
logL(Vietnamese) //"mẹ has 2 visible Characters and 3 normal Characters"
logL(Hindi) //"मां has 1 visible Characters and 3 normal Characters"

So this just checks if theres any Character in the String which is a Unicode NSM character and ignores the count for this, this should work for the Most languages, not Tamil only, And an array with 1280 Elements shouldn't be that big of a performance issue

Here is a list with the Unicode NSM's http://www.fileformat.info/info/unicode/category/Mn/list.htm

Here is the according JSBin


After experimenting a bit with string operations, it turns out String.indexOf returns the same for

"ர்" and for "ர" meaning
"ர்ரர".indexOf("ர்") == "ர்ரர".indexOf("ர" + "்") //true but
"ர்ரர".indexOf("ர") == "ர்ரர".indexOf("ர" + "ர") //false

I took this opportunity and tried something like this

//ர்

var char = "ரர்ர்ரர்்";
var char2 = "ரரர்ர்ரர்்";    
var char3 = "ர்ரர்ர்ரர்்";

function countStr(str) {
         var  chars = str.split("");
         var count = 0;
          for(var i = 0, ilen = chars.length;i<ilen;i++) {
                 var chars2 = chars[i] + chars[i+1];   
                 if (str.indexOf(chars[i]) == str.indexOf(chars2))
                   i += 1;
               count++;
            }
         return count;
 }


console.log("--");

console.log(countStr(char)); //6

console.log(countStr(char2)); //7

console.log(countStr(char3)); //7

Which seems to work for the String above, it may take some adjustments, as i don't know a thing about Encoding and stuff, but maybe its a point you can begin with

Heres the JSBin

Stranger
  • 10,332
  • 18
  • 78
  • 115
Moritz Roessler
  • 8,542
  • 26
  • 51
2

You can ignore combining marks in the count calculation with this function:

function charCount( str ) {
    var re = /[\u0300-\u036f\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f\u0b82\u0b83\u0bbe\u0bbf\u0bc0-\u0bc2\u0bc6-\u0bc8\u0bca-\u0bcd\u0bd7]/g
    return str.replace( re, "").length;
}

console.log(charCount('மதர்'))// 3

//More tests on random Tamil text:
//Paint the text character by character to verify, for instance 'யெ' is a single character, not 2

console.log(charCount("மெய்யெழுத்துக்கள்")); //9
console.log(charCount("ஒவ்வொன்றுடனும்")); //8
console.log(charCount("தமிழ்")); //3
console.log(charCount("வருகின்றனர்.")); //8
console.log(charCount("எழுதப்படும்")); //7

The Tamil signs and marks are not composed into single characters with their target character in unicode, so normalization wouldn't help. I have added all the Tamil combining marks or signs manually to the regex, but it also includes the ranges for normal combining marks, so charCount("ä") is 1 regardless of normalization form.

Esailija
  • 138,174
  • 23
  • 272
  • 326
  • Hey Esailija, it is nicely working for Tamil. But will be there any good solution for all the languages? – Stranger Dec 12 '12 at 07:09
  • @Udhay Yes, I just would need to add them to the regex – Esailija Dec 12 '12 at 07:24
  • I'm not good at Regex. So can you please explain this Regex you have used here. So that i can use it for writing for other languages... – Stranger Dec 12 '12 at 07:48
  • 1
    @Udhay it strips out the code points mentioned in regex. For example, `\u0300-\u036f` strips out all code points in the range `U+0300-U+036f` and `\u0bd7` strips out the code point `U+0bd7`. It's simply a list of code point ranges and individual code points that are not considered for character calculation. – Esailija Dec 12 '12 at 07:52