Reading UTF-8 - BOM marker

Question

I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?

fr = new FileReader(file);
br = new BufferedReader(fr);
    String tmp = null;
    while ((tmp = br.readLine()) != null) {
    String text;    
    text = new String(tmp.getBytes(), "UTF-8");
    content += text + System.getProperty("line.separator");
}

output after first line

?<style>

UTF-8 is not supposed to have a BOM! It is neither necessary **nor recommended** by The Unicode Standard. — tchrist, Feb 04 '11 at 12:22
To expand on Matti's point, all MS text editors prefix UTF-8 documents with a BOM. — Ant, Feb 04 '11 at 12:34
@tchrist tell that to the people who put the BOM in the UTF-8 files (=Microsoft) when saving them. — dstibbe, Jun 08 '12 at 14:01
@dstibbe I am not responsible for Microsoft’s stupidity. I will have no part in it. My hands are clean. — tchrist, Jun 08 '12 at 19:45
@tchrist I wish things were that simple. You create an application for the users, not for yourself. And the users use (partially) Microsoft software to create their files. — dstibbe, Jun 11 '12 at 10:32
BOM is necessary for UTF-16, optional for UTF-8. Java can handle neither (by standard library). C# can handle both. Now talk who follows standard and who does not. — peenut, Jul 21 '12 at 14:40
@peenut, Java *can* handle BOMs in UTF-16, if you tell it to — finnw, Dec 06 '13 at 15:21
possible duplicate of [Byte order mark screws up file reading in Java](http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java) — 200_success, Mar 25 '15 at 20:48
@tchriist, BOM is a standard - not from MS or Unicode. peenut is right, that the BOM for UTF-16 in xml files is a MUST and UTF-8 BOM in xml file is a may. xml Standard at W3org https://www.w3.org/TR/xml/#charencoding. The methods for autodetection of the BOM is a none normative standard. Section >F Autodetection of Character Encodings (Non-Normative) — bernie3280109, Feb 05 '20 at 11:28

score 97 · Accepted Answer · edited May 04 '23 at 01:51

97

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

edited May 04 '23 at 01:51

sideshowbarker

81,827
26
193
197

answered Feb 04 '11 at 12:32

RealHowTo

34,977
11
70
85

Very late to the game, but this seems to be very slow for large files. I tried using a buffer. If you use a buffer, it seems to leave some sort of trailing data, as well. – rocksNwaves Feb 11 '20 at 16:32

score 49 · Answer 2 · answered Feb 04 '11 at 12:32

49

The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.

tmp = tmp.replace("\uFEFF", "");

Also see this Guava bug report

answered Feb 04 '11 at 12:32

finnw

47,861
24
143
221

4

The bad thing about "extremely unlikely" is that it turns up extremely rarely, so that locating the bug is extremely difficult... :) So be extremely wary when using this code if you believe your software will be successful and long-lived, because sooner or later any existing situation will occur. – Franz D. Jul 15 '15 at 02:14
8

`FEFF` is a UTF-16 BOM. The UTF-8 BOM is `EFBBBF`. – Steve Pitchers May 27 '16 at 10:08
6

@StevePitchers but we must match it *after* decoding, when it is part of a `String` (which is always represented as UTF-16) – finnw May 27 '16 at 14:09
What about `\uFFFE` (UTF-16, little-endian)? – Suzana Mar 22 '18 at 12:08
@live-love And if the file doesn't have a BOM, you just truncated the first line. – Eric Duminil May 08 '19 at 12:46
To make sure you only replace the BOM if it's right at the beggining of the string, you could use `tmp = tmp.replaceAll("\\A\uFEFF", "");` – Eric Duminil May 08 '19 at 12:46
What's wrong with replacing it only if it's at the start of the file instead of anywhere in it ? – Joseph Budin Aug 16 '22 at 14:13

score 40 · Answer 3 · edited Mar 25 '15 at 21:33

40

Use the Apache Commons library.

Class: org.apache.commons.io.input.BOMInputStream

Example usage:

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

edited Mar 25 '15 at 21:33

200_success

7,286
1
43
74

answered Dec 21 '12 at 10:26

peenut

3,366
23
24

http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html – bmoc Dec 10 '13 at 19:55
1

This code will only work with UTF-8 BOM detection and excluding. Check the implementation of bOMInputStream: ``` /** * Constructs a new BOM InputStream that detects a * a {@link ByteOrderMark#UTF_8} and optionally includes it. * @param delegate the InputStream to delegate to * @param include true to include the UTF-8 BOM or * false to exclude it */ public BOMInputStream(InputStream delegate, boolean include) { this(delegate, include, ByteOrderMark.UTF_8); } ``` – czupe Aug 30 '17 at 14:04

score 9 · Answer 4 · edited May 27 '16 at 10:13

9

Here's how I use the Apache BOMInputStream, it uses a try-with-resources block. The "false" argument tells the object to ignore the following BOMs (we use "BOM-less" text files for safety reasons, haha):

try( BufferedReader br = new BufferedReader( 
    new InputStreamReader( new BOMInputStream( new FileInputStream(
       file), false, ByteOrderMark.UTF_8,
        ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
        ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
    // use br here

} catch( Exception e)

}

edited May 27 '16 at 10:13

Steve Pitchers

7,088
5
41
41

answered May 25 '16 at 19:25

snakedoctor

185
2
8

2

can never figure out how to post stuff on this site - always ends up AFU. – snakedoctor May 25 '16 at 19:27
If you want a String then you can skip the `BufferedReader` and `InputStreamReader` and use `commons.io.IOUtils` instead: `String xml = IOUtils.toString(bomInputStream, StandardCharsets.UTF_8)` – mihca Jun 01 '22 at 08:01

Adrian Smith · Answer 5 · 2022-10-20T14:53:32.777

8

Consider UnicodeReader from Google which does all this work for you.

Charset utf8 = StandardCharsets.UTF_8;  // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8.name())) {
    ....
}

Maven Dependency:

<dependency>
    <groupId>com.google.gdata</groupId>
    <artifactId>core</artifactId>
    <version>1.47.1</version>
</dependency>

edited Oct 20 '22 at 14:53

answered Feb 12 '18 at 15:03

Adrian Smith

17,236
11
71
93

1

Thanks. It works well and with SuperCSV too. This earned me some browny points. :) – Sacky San May 26 '20 at 03:30
1

Excellent. Very easy solution, that worked great for OpenCSV – grizzasd Nov 16 '20 at 17:10

pawman · Answer 6 · 2017-07-01T16:07:48.817

Use Apache Commons IO.

For example, let's take a look on my code (used for reading a text file with both latin and cyrillic characters) below:

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));

BOMInputStream bomInputStream = new BOMInputStream(inputStream);

ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {

 char theChar = (char) data;
 data = reader.read();
 ari.add(Character.toString(theChar));
}
reader.close();

As a result we have an ArrayList named "ari" with all characters from file "1.txt" excepting BOM.

score 3 · Answer 7 · edited Mar 20 '19 at 17:10

If somebody wants to do it with the standard, this would be a way:

public static String cutBOM(String value) {
    // UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
    String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
    if (bom.equals("efbbbf"))
        // UTF-8
        return value.substring(3, value.length());
    else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
        // UTF-16BE or UTF16-LE
        return value.substring(2, value.length());
    else
        return value;
}

score 2 · Answer 8 · edited Apr 13 '17 at 12:36

2

It's mentioned here that this is usually a problem with files on Windows.

One possible solution would be running the file through a tool like dos2unix first.

edited Apr 13 '17 at 12:36

Community

1
1

answered Feb 26 '17 at 22:43

Drake Sobania

484
3
15

yes, `dos2unix` (which is part of cygwin) has options for adding (`--add-bom`) and removing (`--remove-bom`) bom. – Roman Oct 17 '17 at 11:45

score 1 · Answer 9 · answered Oct 26 '17 at 06:25

The easiest way I found to bypass BOM

BufferedReader br = new BufferedReader(new InputStreamReader(fis));    
while ((currentLine = br.readLine()) != null) {
                    //case of, remove the BOM of UTF-8 BOM
                    currentLine = currentLine.replace("ï»¿","");

Reading UTF-8 - BOM marker

9 Answers9

Linked

Related