Can't read the first word of an Arabic text file

Question

I'm reading an Arabic file using Scanner and storing the text file in an ArrayList

and I have a Dictionary that contains some words, Positive and Negative words with thier rates.

for Example: سعيد +5 -4 سيء

then I check for each word in the text file with Dictionary if the word is negative rise the negative counter and if it is positive rise the positive counter and finally make a comparison to determine if the file is positive or negative

it works perfectly for English but not for Arabic, for some reason it skips the first word on the array even if its on Dictionary with exact match and if I pressed Enter at the beginning of the text file (new line) it works perfectly I tried to Add a new Line to ArrayList and files as alternate to the new line but it doesn't work It has to be added by pressing Enter Button

 for (String word: wordsList) { // loop through user file 

  try { // compare words with dictionary

   String line;
   // read from the Dictionary file 
    File fileDir = new File("C:\\Users\\Ameera\\Desktop\\Dictionary.txt");
            BufferedReader inDict = new BufferedReader(new InputStreamReader(
                    new FileInputStream(fileDir), "utf-8"));

    while ((line = inDict()) != null) {

     String strSplit[] = line.split("\t"); // Split Dictionary line after each tab to get the word only without its rate 
     // example will get (سعيد, سيد) only
     /* سعيد    +5
        سيء         -4
     */

     if (strSplit[0].equals(word)) {


      int rate2 = Integer.parseInt(strSplit[1]); // get word rate  

      sent += rate2; // add word rate to file totoal rate 

     }

    }
   } catch (Exception e) {
    e.printStackTrace();
   }
  }

Use `BufferedReader reader = new BufferedReader(new InputStreamReader("filePath", "UTF-8"))`. See https://stackoverflow.com/a/11377816/6743203 — Jay Smith, Jul 03 '17 at 21:00
@JaySmith is right, I guess. Your code will read the file in platform encoding (most probably not UTF-8, if you're on Windows, e.g.) and your text file might be in UTF-8 and contain a byte order mark, which might interfere with the parsing of the first line. — xmjx, Jul 03 '17 at 21:17
Thanks for your response, actually I've had tried it before, even my files they are encoded as UTF-8, but still I dont know why is it skipping the first word and works when I press Enter at the first line here is what I have BufferedReader inDict = new BufferedReader( new InputStreamReader(new FileInputStream(fileDir2), "UTF-8")); — Qubayl, Jul 04 '17 at 02:18
So have you ever tried to run your code using a debugger and inspect what's retrieved from the first line? — Adrian Shum, Jul 04 '17 at 04:11
No, I tried when I read your comment but I really didnt now how to find the first line on debugging mode — Qubayl, Jul 04 '17 at 15:48
Find a tutorial on debugging in your IDE, and follow it. It's time very well spent. — slim, Jul 05 '17 at 16:14
@Qubayl Check if your file contains a byte order mark. If so, remove it. Furthermore, can we see the piece of code where you read from the scanner? — MC Emperor, Jul 05 '17 at 16:15

score 0 · Accepted Answer · answered Jul 05 '17 at 19:12

0

Thanks guys I really appreciate your responses I found the answer here (Removing BOM characters using Java) MC Emperor Thanks a lot the problem was because of byte order mark.

answered Jul 05 '17 at 19:12

Qubayl

77
1
8

Can't read the first word of an Arabic text file

1 Answers1