5

i am trying to read Unicode characters from a text file saved in utf-8 using java my text file is as follows

अ, अदेबानि ,अन, अनसुला, अनसुलि, अनफावरि, अनजालु, अनद्ला, अमा, अर, अरगा, अरगे, अरन, अराय, अलखद, असे, अहा, अहिंसा, अग्रं, अन्थाइ, अफ्रि, बियन, खियन, फियन, बन, गन, थन, हर, हम, जम, गल, गथ, दरसे, दरनै, थनै, थथाम, सथाम, खफ, गल, गथ, मिख, जथ, जाथ, थाथ, दद, देख, न, नेथ, बर, बुंथ, बिथ, बिख, बेल, मम, आ, आइ, आउ, आगदा, आगसिर

i have tried with the code as followed

import java.io.*;
import java.util.*;
import java.lang.*;
public class UcharRead
{
    public static void main(String args[])
    {
        try
        {
            String str;
            BufferedReader bufReader = new BufferedReader( new InputStreamReader(new FileInputStream("research_words.txt"), "UTF-8"));
            while((str=bufReader.readLine())!=null)
            {
                System.out.println(str);
            }
        }
        catch(Exception e)
        {
        }
    }
}

getting out put as ???????????????????????? can anyone help me

Naveen Kumar Alone
  • 7,536
  • 5
  • 36
  • 57
purnendu
  • 51
  • 1
  • 1
  • 3

3 Answers3

9

You are (most likely) reading the text correctly, but when you write it out, you also need to enable UTF-8. Otherwise every character that cannot be printed in your default encoding will be turned into question marks.

Try writing it to a File instead of System.out (and specify the proper encoding):

Writer w = new OutputStreamWriter(
   new FileOutputStream("x.txt"), "UTF-8");
Thilo
  • 257,207
  • 101
  • 511
  • 656
6

If you are reading the text properly using UTF-8 encoding then make sure that your console also supports UTF-8. In case you are using eclipse then you can enable UTF-8 encoding foryour console by:

Run Configuration->Common -> Encoding -> Select UTF 8

Here is the eclipse screenshot.

enter image description here

Juned Ahsan
  • 67,789
  • 12
  • 98
  • 136
5

You're reading it correctly - the problem is almost certainly just that your console can't handle the text. The simplest way to verify this is to print out each char within the string. For example:

public static void dumpString(String text) {
    for (int i = 0; i < text.length(); i++) {
        char c = text.charAt(i);
        System.out.printf("%c - %04x\n", c, (int) c);
    }
}

You can then verify that each character is correct using the Unicode code charts.

Once you've verified that you're reading the file correctly, you can then work on the output side of things - but it's important to try to focus on one side of it at a time. Trying to diagnose potential failures in both input and output encodings at the same time is very hard.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I think instead of dumping a unicode sheet, it'd be better if he switched to a unicode-compatible terminal – Khaled.K Sep 11 '13 at 06:09
  • @KhaledAKhunaifer: You've missed my point - which is separating out potential input issues from potential output issues. Yes, *after validating that the input is being read correctly* he should look at fixing the output... but when diagnosing encoding issues, it's important to work out exactly where the first problem occurs. Validating that the data is being read correctly is the first step, IMO. – Jon Skeet Sep 11 '13 at 06:12