1

I have this example. It reads a line "hello" from a file saved as utf-8. Here is my question:

Strings are stored in java in UTF-16 format. So when it reads the line hello it converts it to a utf-16 format. So string s is in a utf-16 with a utf-16 BOM... Am i right?

  filereader = new FileReader(file);
  read= new BufferedReader(filereader);
  String s= null;
  while ((s= read.readLine()) != null) 
 {
  System.out.println(s);
 }

So when i do this:

s= s.replace("\uFEFF","A");

nothing happens. Should the above find and replace the UTF-16 BOM? Or is it eventually a utf-8 format? Am a little bit confused about this.

Thank you

Tommaso Belluzzo
  • 23,232
  • 8
  • 74
  • 98
Nick
  • 2,818
  • 5
  • 42
  • 60
  • 1
    The BOM (if present) is really meta-data, not payload so you shouldn't expect it to appear in the resulting file content. – seand Nov 11 '17 at 23:08

1 Answers1

0

Try to use the Apache Commons library and the class org.apache.commons.io.input.BOMInputStream to get rid of this kind of problems.

Example:

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(file);

try
{
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();

    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    // your code...
}
finally
{
    inputStream.close();
}

For what concerns the BOM itself, as @seand said, it's just meta data being used for reading/writing/storing strings in memory. It's present in the strings themselves, but you cannot replace or modify it unless working at binary level or re-encoding the strings.

Let's make a few examples:

String str = "Hadoop";

byte bt1[] = str.getBytes();
System.out.println(bt1.length); // 6

byte bt2a[] = str.getBytes("UTF-16");
System.out.println(bt2a.length); // 14

byte bt2b[] = str.getBytes("UTF-16BE");
System.out.println(bt2b.length); // 14

byte bt3[] = str.getBytes("UTF-16LE");
System.out.println(bt3.length); // 12

In the UTF-16 (which defaults to Big Endian) and UTF-16BE versions, you get 14 bytes because of the BOM being inserted to distinguish between BE and LE. If you specify UTF-16LE you get 12 bytes because of no BOM is being added.

You cannot strip the BOM from a string with a simple replace, as you tried. Because the BOM, if present, is only part of the underlying byte stream that, memory side, is being handled as a string by the java framework. And you can't manipulate it like you manipulate characters that are part of the string itself.

Tommaso Belluzzo
  • 23,232
  • 8
  • 74
  • 98
  • This is not what i asked. I asked how java treats strings with the above example. – Nick Nov 11 '17 at 23:13
  • I expanded my previous post to answer your question. – Tommaso Belluzzo Nov 11 '17 at 23:31
  • 1
    Iam asking this because: https://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker. See finnw's answer with the tmp = tmp.replace("\uFEFF", ""); Is wrong right? – Nick Nov 11 '17 at 23:53
  • I was pretty sure this was the problem, that's why my first reply included that code to handle BOM when reading encoded files. To answer your comment... well, it's valid only if the BOM is not handled/stripped by the reader and instead being included into the output string by mistake (because considered part of the string itself). – Tommaso Belluzzo Nov 11 '17 at 23:54
  • Long time ago I remember I had the same problem while attempting to read an XML file in .NET, which contained a BOM at the beginning of the binary data. The reader wasn't properly handling this and, as result, the output was a malformed XML file with two bytes at the beginning, right before the XML declaration. – Tommaso Belluzzo Nov 12 '17 at 00:13
  • https://stackoverflow.com/questions/1317700/strip-byte-order-mark-from-string-in-c-sharp Read this, and you will also find many other similar posts. – Tommaso Belluzzo Nov 12 '17 at 00:18