tamil character segmentation using java

Question

In the below coding i tries to read the lines with specific length and writes on another notepad.this coding works well for english characters. but for tamil letters if i tries to count ...it count as:

(e.g)தமிழ்

it counts as 5..(i.e)"த", "ம", "ி", "ழ" and "்". but i want to count it as 3(i.e)"த", "மி" and "ழ்"

i want to apply this logic for multiple words from text file....

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;

public class ii {

public static void main(String[] args) {
FileReader fr = null;
BufferedReader br =null;
FileWriter fw=null;
BufferedWriter bw=null;

String [] stringArray;
int counLine = 0;
int arrayLength ;
String s="";
String stringLine="";

try{
    fr = new FileReader("F:\\New folder (2)\\N.txt");
    fw=new FileWriter("F:\\New folder (2)\\o.txt");
    br = new BufferedReader(fr);
    bw=new BufferedWriter(fw);
  while((s = br.readLine()) != null){
        stringLine = stringLine + s;
        stringLine = stringLine + " ";
        counLine ++;
    }
    stringArray = stringLine.split(" ");
    arrayLength = stringArray.length;
for (int i = 0; i < arrayLength; i++) {
        int c = 1 ;
        for (int j = i+1; j < arrayLength; j++) {
            if(stringArray[i].equalsIgnoreCase(stringArray[j])){
               c++;
               for (int j2 = j; j2 < arrayLength; j2++)
                  {
                   }}
         int k;
          for(k=2;k==stringArray[i].length();i++)
          {
          bw.write(stringArray[i]);
           bw.newLine();

          }}} fr.close();
        br.close();
        bw.flush();
        bw.close();
        }catch (Exception e) {
        e.printStackTrace();
        }}}

Possible duplicate. http://stackoverflow.com/questions/15947992/java-unicode-string-length — hamnix, Sep 16 '16 at 07:54

score 1 · Answer 1 · answered Sep 16 '16 at 07:56

1

One way is to iterate through the characters with a BreakIterator, and count them yourself. (untested code)

int characterCount = 0;
BreakIterator iterator = BreakIterator.getCharacterInstance();
iterator.setText("தமிழ்");
int boundary = iterator.first();

while (boundary != BreakIterator.DONE) {
    characterCount++;
    boundary = iterator.next();
}

see also http://docs.oracle.com/javase/tutorial/i18n/text/char.html

answered Sep 16 '16 at 07:56

Joni

108,737
14
143
193

but instead of a specific character i want to count multiple characters from a file – Surya Sep 16 '16 at 08:09
that is what break iterator does – Joni Sep 16 '16 at 08:10
This gives a count of 36 for the 3-character Tamil string "குமார்" (from [this question](https://stackoverflow.com/questions/15947992/java-unicode-string-length)). It gave the same result when I tried `getCharacterInstance(new Locale("ta", "IN"))` ([mentioned here](https://stackoverflow.com/questions/17292575/how-to-get-all-java-supported-locales)) – Joshua Goldberg Jun 29 '20 at 18:34

score 0 · Answer 2 · answered Sep 16 '16 at 07:03

Basically this happen due to encoding problem so,First change the text file Encoding of your java project by following the below steps

Right click your project Name=>select properties=>select resource=>Text File encoding=>chose other and select UTF- 8 as encoding,

This will resolve your issue.

score 0 · Answer 3 · edited May 23 '17 at 11:53

0

Notepad don't support UTF characters, by default. Instead it supports ANSI. However your problem is not due to this.

Your program should know what encoding it's going to use while reading or writing. There is no magic. You need to set the encoding (for e.g. - UTF8). The constructure of FileReader takes default platform coding which clearly won't work for you.

I guess you need -

Reader reader = new InputStreamReader(new FileInputStream("c:/foo.txt"), "UTF-8");

Read file and write file which has characters in UTF - 8 (different language)

edited May 23 '17 at 11:53

Community

1
1

answered Sep 16 '16 at 07:05

Sanjeev Dhiman

1,169
1
11
20

what ever we do, character count will be 5, how to get real count ie., 3 – Ajay Sreeram Sep 16 '16 at 07:10

score 0 · Answer 4 · answered Nov 20 '16 at 19:40

0

It is because string counts unicode mark and unicode letters. to ignore unicode marks, you can use regular expression as below

import java.util.regex.*;
 ......
String word = "தமிழ்";
String regex = "[^\u0bbe-\u0bcd.]";
  Pattern r = Pattern.compile(regex);
  Matcher m = r.matcher(word);
    int count=0;
while (m.find())count++;
System.out.print(count);

answered Nov 20 '16 at 19:40

Neechalkaran

413
4
6

hii...thanks for your reply...it is gd for single word..but how to use a file here with multiple unicode words..where we want to get three character words? – Surya Nov 21 '16 at 14:37
About code is independent for all Tamil letters in a file. use the same and count all Tamil letters in string – Neechalkaran Jan 03 '17 at 15:23

tamil character segmentation using java

4 Answers4

Linked