What exactly is the BOM in a ANSI XML document and should it be removed? Should a XML document be in UTF-8 instead? Can anyone tell me a Java method that will detect the BOM? The BOM consists of the characters EF BB BF .
4 Answers
For a ANSI XML file it should actually be removed. If you want to use UTF-8 you don't really need it. Only for UTF-16 and UTF-32 it is needed.
The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8.
(Source: https://www.opentag.com/xfaq_enc.htm#enc_bom)
Regarding the question on how detect this in java.
Check the following answer to this question: Java : How to determine the correct charset encoding of a stream and if you now want to determine the BOM yourself (at your own risk) check for example this code Java Tip: How to read a file and automatically specify the correct encoding.
Basically just read in the first few bytes yourself and then determine if you may have found a BOM.

- 53,475
- 11
- 111
- 124
-
thanks for the great answer. since i am expecting the file to be UTF-8 I am just ignoring the first 3 chars using something like: String file1sub = getXMLContents(file1).substring(3); – djangofan Nov 20 '09 at 18:41
-
1@jitter - I'm not sure where your quote on BOMs comes from. XML doesn't require a BOM in UTF-16 or UTF-32 documents - a parser should manage without. XML encoding detection: http://www.w3.org/TR/REC-xml/#sec-guessing Otherwise, the requirement for a BOM it is domain-dependent. Unicode.org BOM FAQ: http://unicode.org/faq/utf_bom.html#BOM – McDowell Nov 20 '09 at 19:09
-
that explains why Notepad++ allows you to set the default for new files to be "UTF-8 without BOM" – djangofan Jun 24 '11 at 22:14
The byte order mark is likely to be one of these byte sequences:
UTF-8 BOM: ef bb bf
UTF-16BE BOM: fe ff
UTF-16LE BOM: ff fe
UTF-32BE BOM: 00 00 fe ff
UTF-32LE BOM: ff fe 00 00
These are the variously encoded forms of the Unicode codepoint U+FEFF. This can be expressed as a Java char literal using '\uFEFF'
(Java char values are implicitly UTF-16). Since U+FEFF isn't in most encodings, it is not possible for this BOM codepoint to be encoded by them. (More on encoding the BOM using Java here.)
When it comes to BOMs and XML, they are optional (see also the Unicode BOM FAQ). Detection of encoding in XML is relatively straightforward if the encoding is specified in the declaration. Always make sure that the XML declaration (<?xml version="1.0" encoding="UTF-8"?>
) matches the encoding used to write the document. If you are strict about this, parsers should be able to interpret your documents correctly. (XML spec on encoding detection.)
I advocate encoding as Unicode wherever possible (see also the 10 Commandments of Unicode). That said, XML allows the representation of any Unicode character via escape entities (e.g. 'A' could be represented by A
), so it isn't necessarily a requirement to avoid data loss.

- 107,573
- 31
- 204
- 267
-
1*»XML allows the representation of any Unicode character via escape entities«* – well, except you need CDATA sections ;-) – Joey Nov 07 '16 at 14:01
Do not insert a BOM in a utf-8 file: if two such files are merged, you end up with a BOM in the middle which might break an applicaton, or cause an xml parser to throw an exception.

- 395
- 1
- 2
- 4
-
Ahh. Interesting tip. I never thought of that. Luckily, merging XML files is not that common. – djangofan Aug 21 '12 at 17:48
-
-
1You should never merge XML files as simple text files. Every XML file should start with a prolog. – Vity Jun 29 '19 at 09:33
OP:
Can anyone tell me a Java method that will detect the BOM?
org.apache.commons.io.input.BOMInputStream
Javadocs:
This class detects these bytes and, if required, can automatically skip them and return the subsequent byte as the first byte in the stream.

- 1,337
- 17
- 13
-
I'm not sure how this might be helpful to answering the question "What is XML BOM and how do I detect it?" – Matt Jun 03 '14 at 19:04
-
@Matt - I copied the description from the Javadocs. Hope that helps. – Robert Fleming Jun 03 '14 at 23:52