Try to use the Apache Commons library and the class org.apache.commons.io.input.BOMInputStream to get rid of this kind of problems.
Example:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(file);
try
{
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
// your code...
}
finally
{
inputStream.close();
}
For what concerns the BOM itself, as @seand said, it's just meta data being used for reading/writing/storing strings in memory. It's present in the strings themselves, but you cannot replace or modify it unless working at binary level or re-encoding the strings.
Let's make a few examples:
String str = "Hadoop";
byte bt1[] = str.getBytes();
System.out.println(bt1.length); // 6
byte bt2a[] = str.getBytes("UTF-16");
System.out.println(bt2a.length); // 14
byte bt2b[] = str.getBytes("UTF-16BE");
System.out.println(bt2b.length); // 14
byte bt3[] = str.getBytes("UTF-16LE");
System.out.println(bt3.length); // 12
In the UTF-16 (which defaults to Big Endian) and UTF-16BE versions, you get 14 bytes because of the BOM being inserted to distinguish between BE and LE. If you specify UTF-16LE you get 12 bytes because of no BOM is being added.
You cannot strip the BOM from a string with a simple replace, as you tried. Because the BOM, if present, is only part of the underlying byte stream that, memory side, is being handled as a string by the java framework. And you can't manipulate it like you manipulate characters that are part of the string itself.