90

I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?

fr = new FileReader(file);
br = new BufferedReader(fr);
    String tmp = null;
    while ((tmp = br.readLine()) != null) {
    String text;    
    text = new String(tmp.getBytes(), "UTF-8");
    content += text + System.getProperty("line.separator");
}

output after first line

?<style>
onigunn
  • 4,730
  • 10
  • 58
  • 89
  • 7
    UTF-8 is not supposed to have a BOM! It is neither necessary **nor recommended** by The Unicode Standard. – tchrist Feb 04 '11 at 12:22
  • 36
    @tchrist: At Microsoft, they do not care about standards. – Matti Virkkunen Feb 04 '11 at 12:24
  • 3
    To expand on Matti's point, all MS text editors prefix UTF-8 documents with a BOM. – Ant Feb 04 '11 at 12:34
  • 12
    @Matti "not recommended" != non-standard – bacar Jan 31 '12 at 16:42
  • 8
    @tchrist tell that to the people who put the BOM in the UTF-8 files (=Microsoft) when saving them. – dstibbe Jun 08 '12 at 14:01
  • 1
    @dstibbe I am not responsible for Microsoft’s stupidity. I will have no part in it. My hands are clean. – tchrist Jun 08 '12 at 19:45
  • 7
    @tchrist I wish things were that simple. You create an application for the users, not for yourself. And the users use (partially) Microsoft software to create their files. – dstibbe Jun 11 '12 at 10:32
  • 3
    BOM is necessary for UTF-16, optional for UTF-8. Java can handle neither (by standard library). C# can handle both. Now talk who follows standard and who does not. – peenut Jul 21 '12 at 14:40
  • @peenut, Java *can* handle BOMs in UTF-16, if you tell it to – finnw Dec 06 '13 at 15:21
  • possible duplicate of [Byte order mark screws up file reading in Java](http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java) – 200_success Mar 25 '15 at 20:48
  • 1
    @tchriist, BOM is a standard - not from MS or Unicode. peenut is right, that the BOM for UTF-16 in xml files is a MUST and UTF-8 BOM in xml file is a may. xml Standard at W3org https://www.w3.org/TR/xml/#charencoding. The methods for autodetection of the BOM is a none normative standard. Section >F Autodetection of Character Encodings (Non-Normative) – bernie3280109 Feb 05 '20 at 11:28

9 Answers9

97

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
RealHowTo
  • 34,977
  • 11
  • 70
  • 85
  • Very late to the game, but this seems to be very slow for large files. I tried using a buffer. If you use a buffer, it seems to leave some sort of trailing data, as well. – rocksNwaves Feb 11 '20 at 16:32
49

The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.

tmp = tmp.replace("\uFEFF", "");

Also see this Guava bug report

finnw
  • 47,861
  • 24
  • 143
  • 221
  • 4
    The bad thing about "extremely unlikely" is that it turns up extremely rarely, so that locating the bug is extremely difficult... :) So be extremely wary when using this code if you believe your software will be successful and long-lived, because sooner or later any existing situation will occur. – Franz D. Jul 15 '15 at 02:14
  • 8
    `FEFF` is a UTF-16 BOM. The UTF-8 BOM is `EFBBBF`. – Steve Pitchers May 27 '16 at 10:08
  • 6
    @StevePitchers but we must match it *after* decoding, when it is part of a `String` (which is always represented as UTF-16) – finnw May 27 '16 at 14:09
  • What about `\uFFFE` (UTF-16, little-endian)? – Suzana Mar 22 '18 at 12:08
  • @live-love And if the file doesn't have a BOM, you just truncated the first line. – Eric Duminil May 08 '19 at 12:46
  • To make sure you only replace the BOM if it's right at the beggining of the string, you could use `tmp = tmp.replaceAll("\\A\uFEFF", "");` – Eric Duminil May 08 '19 at 12:46
  • What's wrong with replacing it only if it's at the start of the file instead of anywhere in it ? – Joseph Budin Aug 16 '22 at 14:13
40

Use the Apache Commons library.

Class: org.apache.commons.io.input.BOMInputStream

Example usage:

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}
200_success
  • 7,286
  • 1
  • 43
  • 74
peenut
  • 3,366
  • 23
  • 24
  • http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html – bmoc Dec 10 '13 at 19:55
  • 1
    This code will only work with UTF-8 BOM detection and excluding. Check the implementation of bOMInputStream: ``` /** * Constructs a new BOM InputStream that detects a * a {@link ByteOrderMark#UTF_8} and optionally includes it. * @param delegate the InputStream to delegate to * @param include true to include the UTF-8 BOM or * false to exclude it */ public BOMInputStream(InputStream delegate, boolean include) { this(delegate, include, ByteOrderMark.UTF_8); } ``` – czupe Aug 30 '17 at 14:04
9

Here's how I use the Apache BOMInputStream, it uses a try-with-resources block. The "false" argument tells the object to ignore the following BOMs (we use "BOM-less" text files for safety reasons, haha):

try( BufferedReader br = new BufferedReader( 
    new InputStreamReader( new BOMInputStream( new FileInputStream(
       file), false, ByteOrderMark.UTF_8,
        ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
        ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
    // use br here

} catch( Exception e)

}
Steve Pitchers
  • 7,088
  • 5
  • 41
  • 41
snakedoctor
  • 185
  • 2
  • 8
  • 2
    can never figure out how to post stuff on this site - always ends up AFU. – snakedoctor May 25 '16 at 19:27
  • If you want a String then you can skip the `BufferedReader` and `InputStreamReader` and use `commons.io.IOUtils` instead: `String xml = IOUtils.toString(bomInputStream, StandardCharsets.UTF_8)` – mihca Jun 01 '22 at 08:01
8

Consider UnicodeReader from Google which does all this work for you.

Charset utf8 = StandardCharsets.UTF_8;  // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8.name())) {
    ....
}

Maven Dependency:

<dependency>
    <groupId>com.google.gdata</groupId>
    <artifactId>core</artifactId>
    <version>1.47.1</version>
</dependency>
Adrian Smith
  • 17,236
  • 11
  • 71
  • 93
7

Use Apache Commons IO.

For example, let's take a look on my code (used for reading a text file with both latin and cyrillic characters) below:

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));

BOMInputStream bomInputStream = new BOMInputStream(inputStream);

ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {

 char theChar = (char) data;
 data = reader.read();
 ari.add(Character.toString(theChar));
}
reader.close();

As a result we have an ArrayList named "ari" with all characters from file "1.txt" excepting BOM.

pawman
  • 81
  • 1
  • 5
3

If somebody wants to do it with the standard, this would be a way:

public static String cutBOM(String value) {
    // UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
    String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
    if (bom.equals("efbbbf"))
        // UTF-8
        return value.substring(3, value.length());
    else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
        // UTF-16BE or UTF16-LE
        return value.substring(2, value.length());
    else
        return value;
}
Matt
  • 1,518
  • 4
  • 16
  • 30
Markus
  • 31
  • 1
2

It's mentioned here that this is usually a problem with files on Windows.

One possible solution would be running the file through a tool like dos2unix first.

Community
  • 1
  • 1
Drake Sobania
  • 484
  • 3
  • 15
  • yes, `dos2unix` (which is part of cygwin) has options for adding (`--add-bom`) and removing (`--remove-bom`) bom. – Roman Oct 17 '17 at 11:45
1

The easiest way I found to bypass BOM

BufferedReader br = new BufferedReader(new InputStreamReader(fis));    
while ((currentLine = br.readLine()) != null) {
                    //case of, remove the BOM of UTF-8 BOM
                    currentLine = currentLine.replace("","");
David
  • 11
  • 1