2

In the below coding i tries to read the lines with specific length and writes on another notepad.this coding works well for english characters. but for tamil letters if i tries to count ...it count as:

(e.g)தமிழ்

it counts as 5..(i.e)"த", "ம", "ி", "ழ" and "்". but i want to count it as 3(i.e)"த", "மி" and "ழ்"

i want to apply this logic for multiple words from text file....

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;

public class ii {

public static void main(String[] args) {
FileReader fr = null;
BufferedReader br =null;
FileWriter fw=null;
BufferedWriter bw=null;

String [] stringArray;
int counLine = 0;
int arrayLength ;
String s="";
String stringLine="";

try{
    fr = new FileReader("F:\\New folder (2)\\N.txt");
    fw=new FileWriter("F:\\New folder (2)\\o.txt");
    br = new BufferedReader(fr);
    bw=new BufferedWriter(fw);
  while((s = br.readLine()) != null){
        stringLine = stringLine + s;
        stringLine = stringLine + " ";
        counLine ++;
    }
    stringArray = stringLine.split(" ");
    arrayLength = stringArray.length;
for (int i = 0; i < arrayLength; i++) {
        int c = 1 ;
        for (int j = i+1; j < arrayLength; j++) {
            if(stringArray[i].equalsIgnoreCase(stringArray[j])){
               c++;
               for (int j2 = j; j2 < arrayLength; j2++)
                  {
                   }}
         int k;
          for(k=2;k==stringArray[i].length();i++)
          {
          bw.write(stringArray[i]);
           bw.newLine();

          }}} fr.close();
        br.close();
        bw.flush();
        bw.close();
        }catch (Exception e) {
        e.printStackTrace();
        }}}
Surya
  • 115
  • 8
  • 2
    Possible duplicate. http://stackoverflow.com/questions/15947992/java-unicode-string-length – hamnix Sep 16 '16 at 07:54

4 Answers4

1

One way is to iterate through the characters with a BreakIterator, and count them yourself. (untested code)

int characterCount = 0;
BreakIterator iterator = BreakIterator.getCharacterInstance();
iterator.setText("தமிழ்");
int boundary = iterator.first();

while (boundary != BreakIterator.DONE) {
    characterCount++;
    boundary = iterator.next();
}

see also http://docs.oracle.com/javase/tutorial/i18n/text/char.html

Joni
  • 108,737
  • 14
  • 143
  • 193
  • but instead of a specific character i want to count multiple characters from a file – Surya Sep 16 '16 at 08:09
  • that is what break iterator does – Joni Sep 16 '16 at 08:10
  • This gives a count of 36 for the 3-character Tamil string "குமார்" (from [this question](https://stackoverflow.com/questions/15947992/java-unicode-string-length)). It gave the same result when I tried `getCharacterInstance(new Locale("ta", "IN"))` ([mentioned here](https://stackoverflow.com/questions/17292575/how-to-get-all-java-supported-locales)) – Joshua Goldberg Jun 29 '20 at 18:34
0

Basically this happen due to encoding problem so,First change the text file Encoding of your java project by following the below steps

Right click your project Name=>select properties=>select resource=>Text File encoding=>chose other and select UTF- 8 as encoding,

This will resolve your issue.

KAmit
  • 337
  • 3
  • 13
0

Notepad don't support UTF characters, by default. Instead it supports ANSI. However your problem is not due to this.

Your program should know what encoding it's going to use while reading or writing. There is no magic. You need to set the encoding (for e.g. - UTF8). The constructure of FileReader takes default platform coding which clearly won't work for you.

I guess you need -

Reader reader = new InputStreamReader(new FileInputStream("c:/foo.txt"), "UTF-8");

Read file and write file which has characters in UTF - 8 (different language)

Community
  • 1
  • 1
Sanjeev Dhiman
  • 1,169
  • 1
  • 11
  • 20
0

It is because string counts unicode mark and unicode letters. to ignore unicode marks, you can use regular expression as below

import java.util.regex.*;
 ......
String word = "தமிழ்";
String regex = "[^\u0bbe-\u0bcd.]";
  Pattern r = Pattern.compile(regex);
  Matcher m = r.matcher(word);
    int count=0;
while (m.find())count++;
System.out.print(count);
Neechalkaran
  • 413
  • 4
  • 6
  • hii...thanks for your reply...it is gd for single word..but how to use a file here with multiple unicode words..where we want to get three character words? – Surya Nov 21 '16 at 14:37
  • About code is independent for all Tamil letters in a file. use the same and count all Tamil letters in string – Neechalkaran Jan 03 '17 at 15:23