Urdu file reading in java

Question

I am trying to read a file which have urdu data. When I view the file in Notepad++ it has data in urdu. But when I view it in eclipse then it shows some type of encoding (may be it has get some default).

Original Urdu Data (Notepad++):

"10","کراچی میں ٹماٹر کی قیمت میں کمی،25روپے فی کلو ہوگیا","Entertainment"

In eclipse:

"10","Ú©Ø±Ø§Ú†ÛŒ Ù…ÛŒÚº Ù¹Ù…Ø§Ù¹Ø± Ú©ÛŒ Ù‚ÛŒÙ…Øª Ù…ÛŒÚº Ú©Ù…ÛŒØŒ25Ø±ÙˆÙ¾Û’ Ù�ÛŒ Ú©Ù„Ùˆ Û�ÙˆÚ¯ÛŒØ§","Entertainment"

Now this is strange by default some encoding is happened. Is there any way that I can get data in original form so that when I do some processing on it and write it in file then I want processed data in original Urdu form instead of any encoding.

Here is the code.

public class DataProcessing {

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        DataProcessing dataProcessingObj = new DataProcessing();
        dataProcessingObj.readDataFromFile("small_dataset.txt");
    }

    private void readDataFromFile(String fileName)
    {
        BufferedReader  br = null;
        try{
            br = new BufferedReader(new FileReader(fileName));
            String line = "";
            while( (line = br.readLine()) != null )
            {
                System.out.println(line);
            }
        }
        catch(Exception ex){
            ex.printStackTrace();
        }
    }
}

If you can help me I will be thankful to you.

In what character encoding is the file saved? How are you reading the content of the file and displaying it in your application? — Jesper, Oct 27 '16 at 18:50
@Jesper How can I know in which character encoding my file is saved? — Hammad Hassan, Oct 27 '16 at 19:03
Your code is reading the text file using the default character encoding of your system. If the file is actually encoded in a different character encoding, you'll get wrong results. Instead of `FileReader` use `new InputStreamReader(new FileInputStream(fileName), "UTF-8")` (if the file is encoded in UTF-8; if it's something else, use the appropriate character set name). — Jesper, Oct 27 '16 at 19:05
When you open the file in Notepad++, look at the lower right corner of the window, it shows what Notepad++ thinks the encoding of the file is. — Jesper, Oct 27 '16 at 19:06
@Jesper Now I have tried this, buffer reader with encoding scheme. But now I am getting this. "10","????? ??? ????? ?? ???? ??? ????25???? ?? ??? ?????","Entertainment" Still unable to get Urdu Language text. — Hammad Hassan, Oct 27 '16 at 19:09
You are printing the output to the console using `System.out.println()`. Are you sure that the console is using a font that can display Urdu characters? If the font doesn't have these characters, you get question marks instead. — Jesper, Oct 27 '16 at 20:00

score 1 · Answer 1 · answered Oct 27 '16 at 19:10

Do not use FileReader/FileWriter as they are old utility classes using the default platform encoding. You want to specify the encoding, either UTF-8 or Windows-1256. (Notepad++ will show the right encoding.)

private void readDataFromFile(String fileName)
{
    Path path = Paths.get(fileName);
    Charset charset = StandardCharsets.UTF_8;
    try (BufferedReader br = Files.newBufferedReader(path, charset)) {
        String line;
        while( (line = br.readLine()) != null )
        {
            System.out.println(line);
        }
    }
    catch(Exception ex) {
        ex.printStackTrace();
    }
}

Or in java 8:

private void readDataFromFile(String fileName) throws IOException
{
    Path path = Paths.get(fileName);
    Charset charset = Charset.forName("Window-1256");
    Files.lines(path, charset).forEach(System.out::println);
}

I am unable to get my original Urdu Language text. It is showing me following results when I print it. "10","????? ??? ????? ?? ???? ??? ????25???? ?? ??? ?????","Entertainment" — Hammad Hassan, Oct 27 '16 at 19:17
System.out then cannot display the encodiing. Write to file instead using UTF-8 — Joop Eggen, Oct 27 '16 at 21:58

Urdu file reading in java

1 Answers1