0

I am trying to read an UTF8 text file and then make a text comparison with equals() that should return true. But it does not, as getBytes() returns differnt values.

This is a minimal example:

public static void main(String[] args) throws Exception {
  System.out.println(Charset.defaultCharset()); // UTF-8
  InputStream is = new FileInputStream("./myUTF8File.txt");
  BufferedReader in = new BufferedReader(new InputStreamReader(is, "UTF8"));
  String line;
  while ((line = in.readLine()) != null) {
    System.out.print(line); // mouseover
    byte[] bytes = line.getBytes(); // [-17, -69, -65, 109, 111, 117, 115, 101, 111, 118, 101, 114]
    String str = "mouseover";
    byte[] bytesStr = str.getBytes(); // [109, 111, 117, 115, 101, 111, 118, 101, 114]
    if (line.equals(str)) { // false
      System.out.println("equal");
    }
  }
}

I would expect that the String is convertet to UTF-16 at line.readLine() and that equals returns true. Cannot figure out why.

Christian
  • 1,308
  • 3
  • 14
  • 24
  • 1
    Also: don't use `getBytes()` like this, it uses the platform default encoding and that's just a plain bad idea (most of the time). – Joachim Sauer Oct 07 '13 at 14:56

1 Answers1

3

The beginning bytes of the file:

-17, -69, -65

is the bytes of the BOM: Byte Order Mark... Some correlation of your data:

[-17, -69, -65, 109, 111, 117, 115, 101, 111, 118, 101, 114]
               [109, 111, 117, 115, 101, 111, 118, 101, 114]

Also, the proper name of the charset is "UTF-8" -- note the dash

BufferedReader in = new BufferedReader(new InputStreamReader(is, "UTF-8"));
ppeterka
  • 20,583
  • 6
  • 63
  • 78
  • With that in mind i discovered a similar thread http://stackoverflow.com/questions/9736999/how-to-remove-bom-from-an-xml-file-in-java – Christian Oct 07 '13 at 15:39
  • @Chris How does that help here? OP does not want to deal with the byte[]'s, just the Strings. And the proper charset declaration takes care of that... – ppeterka Oct 07 '13 at 15:42
  • No, the proper charset declaration does not help. I used a similar version of the "checkForUtf8BOMAndDiscardIfAny"-Method to make it work. – Christian Oct 08 '13 at 15:43