0

Scenario: I want to read an Arabic dataset with utf-8 encoding. Each word in each line is separated by a space.


Problem: When I read each line, the output is:

??????? ?? ???? ?? ???


Question: How can I read the file and print each line? for more information, here is my Arabic dataset and part of my source code that reads data would be like the following:

private ContextCountsImpl extractContextCounts(Map<Integer, String> phraseMap) throws IOException {
        Reader reader;
        reader = new InputStreamReader(new FileInputStream(inputFile), "utf-8");
        BufferedReader rdr = new BufferedReader(reader);
        while (rdr.ready()) {
            String line = rdr.readLine();
            System.out.println(line);
            List<String> phrases = splitLineInPhrases(line);
            //any process on this file
        }
}
MeirDayan
  • 620
  • 5
  • 20
  • Have a look at [Arabic text](https://stackoverflow.com/questions/2996475/what-character-encoding-should-i-use-for-a-web-page-containing-mostly-arabic-tex). You should identify the file encoding, can be `UTF-16` – Butiri Dan Jun 20 '19 at 10:42
  • 1
    Actually, it works for me on provided dataset. – mslowiak Jun 20 '19 at 10:45
  • @mq007 Do you mean you can read this dataset with this code? – MeirDayan Jun 20 '19 at 10:54
  • @CommunityAns yes – mslowiak Jun 20 '19 at 10:58
  • Your code also works for me (with minor necessary corrections to allow compilation). Your problem might be that the font used for rendering the output does not support Arabic. I tested with the output font set to Monospaced, Arial, Times New Roman and Courier New, and they all worked. Consolas did not work. What is the font you are using to display the Arabic text? – skomisa Jun 30 '19 at 14:25

1 Answers1

0

I can read using UTF-8, Can you try like this.

public class ReadArabic {
    public static void main(String[] args) {
        try {
            String line;
            InputStream fileInputStream = new FileInputStream("arabic.txt");
            Reader reader = new InputStreamReader(fileInputStream, "UTF-8"); // leave charset out for default
            BufferedReader bufferedReader = new BufferedReader(reader);
            while ((line = bufferedReader.readLine()) != null) {
                System.out.println(line);
            }
        } catch (Exception e) {
            System.err.println(e.getMessage()); // handle all exceptions
        }
    }
}

Output

Muhammad Usman
  • 863
  • 1
  • 11
  • 18