8

We are wondering if there is any method to split a Kannada word to get the syllabic clusters using JavaScript.

For example, I want to split the word ಕನ್ನಡ into the syllabic clusters ["ಕ", "ನ್ನ", "ಡ"]. But when I split it with split, the actual array obtained is ["ಕ", "ನ", "್", "ನ", "ಡ"]

Example Fiddle

mpsbhat
  • 2,733
  • 12
  • 49
  • 105
  • 1
    Probably there is no inbuilt way to do this with pure JS, you may need a library that is aware of the language and how to do correct word segmentation for it. – Karl Reid Jun 01 '17 at 12:31
  • 2
    This might be of some interest http://unicode.org/charts/PDF/U0C80.pdf that word consists of 5 unicode characters... – Suraj Rao Jun 01 '17 at 12:33
  • Is there any simplest way to do the same. The unicode conversion may cost lot of coding and looping. – mpsbhat Jun 01 '17 at 12:36
  • Could you use a regex to split by characters in the alphabet and not vowels? – haakym Jun 01 '17 at 12:48
  • regex can be used for predefined words but if the words are randomly generating from server side it may cost long regex we guess. – mpsbhat Jun 01 '17 at 12:52

2 Answers2

3

I cannot say that this is a complete solution. But works to an extent with some basic understanding of how words are formed:

var k = 'ಕನ್ನಡ';
var parts = k.split('');
arr = []; 
for(var i=0; i< parts.length; i++) {
  var s = k.charAt(i); 

  // while the next char is not a swara/vyanjana or previous char was a virama 
  while((i+1) < k.length && k.charCodeAt(i+1) < 0xC85 || k.charCodeAt(i+1) > 0xCB9 || k.charCodeAt(i) == 0xCCD) { 
    s += k.charAt(i+1); 
    i++; 
  } 
  arr.push(s);
}
console.log(arr);

As the comments in the code say, we keep appending chars to previous char as long as they are not swara or vyanjana or previous char was a virama. You might have to work with different words to make sure you cover different cases. This particular case doesn't cover the numbers.

For Character codes you can refer to this link: http://www.unicode.org/charts/PDF/U0C80.pdf

bugs_cena
  • 495
  • 5
  • 11
  • not really sure why this answer got downvoted -:) here's a plunk to show the working - open the console and check the logs - http://plnkr.co/edit/gMZZ7ZlttkmsDDqllrFd?p=preview – bugs_cena Jun 01 '17 at 13:41
2

Consider using the "inSC" property associated with Unicode characters--you can get this from a database--which indicates the Indic Syllabic Character. (You might also want to consult the "category", to see if it is "non-spacing mark"). For instance, ""್" has the type "Virama" (see http://graphemica.com/0CCD). To take another example, "ಿ" (KANNADA VOWEL SIGN I) has an InSC of "Vowel_Dependent" (and is also in the "non-spacing mark" category). You could potentially then detect which individual graphemes need to be combined with others, and put back together complete characters, as follows:

const graphemes = [..."ಕನ್ನಡ"];

console.log("graphemes are", graphemes);

const rebuild = [graphemes[0], graphemes.slice(1, 4).join(''), graphemes[4]];

console.log(rebuild);

Even if you can make this work, you'll have more work to do. It's unclear to me how you would detect that the three characters "ನ", ""್", and "ನ" are to be combined, rather than treated as the two characters "ನ್" and "ನ". The problem is that in this case the virama is used to indicate a consonant cluster, so you would need to identify the X-V-X pattern (where V is virama) and treat that as one combined character. There are probably many, many other such special cases.

This might be of interest: https://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htmj. It talks about finding "syllable clusters", in this particular case as a prelude for rendering the characters graphically. You may also want to take a look at http://www.unicode.org/L2/L2003/03068-kannada.pdf.