0

I am having a little problem with the UTF-8 charset. I have a UTF-8 encoded file which I want to load and analyze. I am using BufferedReader to read the file line by line.

BufferedReader buffReader = new BufferedReader(new InputStreamReader
(new FileInputStream(file),"UTF-8"));

My problem is that the normals String methods (trim() and equals() for example) in Java are not suitable to use with the line read from the BufferReader in every iteration of the loop that I created to read all the content of the BufferedReader. For example, in the encoded file, I have < menu > which I want my program to treat it as it is, however, for now, it is seen as ?? < m e n u > mixed with some others strange characters. I want to know if there is a way to remove all the charset codifications and keep just the plain text so I can use all the methods of the String class without complications. Thank you

Vinay Prajapati
  • 7,199
  • 9
  • 45
  • 86
Youssef
  • 9
  • 1
  • 2
  • 7
    i doubt the input file is just `< menu >` I'm guessing there are other characters at the front. If the file is truely UTF-8 then your code should be fine. – MeBigFatGuy Apr 30 '11 at 17:21
  • 1
    `trim()` and `equals()` work the same regardless of where the String came from. I suggest you look at what your program is doing in a debugger to see what is really going on. – Peter Lawrey Apr 30 '11 at 17:33
  • 8
    Your UTF8 file may contains a BOM and Java can't handle that. See this related question : http://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker – RealHowTo Apr 30 '11 at 18:26
  • +1 for the [byte order mark (BOM)](https://secure.wikimedia.org/wikipedia/en/wiki/Byte_order_mark#UTF-8). Is `` the root tag of your file and the special characters appear there? – Axel Knauf Apr 30 '11 at 18:59
  • Never use BOMs with UTF-8. They are neither required nor recommended. They are a very bad idea, actually a form of Microsoft bug. – tchrist Apr 30 '11 at 22:44
  • Also, unless you call the other form of the constructor, the one that reads `InputStreamReader(InputStream in, CharsetDecoder dec)`, you will not be notified of encoding errors. – tchrist Apr 30 '11 at 22:45
  • Thank you for the replies, i just solved the problem using a non BOM UTF-8 encoded file. – Youssef May 01 '11 at 01:36
  • btw, there is no such thing as "plain text" without character encoding. All characters are always encoded - even ASCII. They are stored in the end as binary bits. Reading it with the wrong encoding will give undefined results, such as reading a character array out of an integer in C++. Thankfully in Java, you can't do such extreme nonsense. – Alan Escreet May 18 '11 at 14:19
  • Try changing to UTF-16. – Abrar Malekji Jan 27 '21 at 08:12

1 Answers1

0

If your jdk is not getting too old (1.5) you can do it like this :

Locale frLocale = new Locale("fr", "FR");
Scanner scanner = new Scanner(new FileInputStream(file), "UTF-8");
scanner.useLocale(frLocale);

for (; scanner.hasNextLine(); numLine++) {
 line = scanner.nextLine();
}

The scanner can also use delimiters other than whitespace. This example reads several items in from a string:

         String input = "1 fish 2 fish red fish blue fish";
         Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");
         System.out.println(s.nextInt());
         System.out.println(s.nextInt());
         System.out.println(s.next());
         System.out.println(s.next());
         s.close(); 

prints the following output:

         1
         2
         red
         blue 

see Doc for Scanner here

EricParis16
  • 809
  • 5
  • 9