135

I am parsing an XML file using Sax Parser of Xerces.
Is the XML declaration <?xml version="1.0" encoding="UTF-8"?> required?

Volker E.
  • 5,911
  • 11
  • 47
  • 64
eros
  • 4,946
  • 18
  • 53
  • 78
  • 3
    There is a difference between valid and well-formed documents. Which of those do you mean? – Felix Kling Aug 10 '11 at 07:47
  • I am receiving prolog error/invalid utf-8 encoding. Then i found BOM in XML file which the user open the file using notepad (i can't avoid this). i am not sure i'm referring to a valid or well-formed documents. Just need to avoid the errors that's why I am creating a function that remove all bytes prior to "<". Which I need to make sure that xml header declaration is required. What do you think guys? – eros Aug 10 '11 at 08:03
  • Is there a java class does the removal of BOM? or few bytes from the xml file? from InputStream. I am thinking of skip method from FilterInputStream & PushbackInputStream but don't have idea on how to use it. – eros Aug 10 '11 at 08:27
  • @eros: "*i am not sure i'm referring to a valid or well-formed documents*" See [Well-formed vs Valid XML](http://stackoverflow.com/a/25830482/290085) for a concise explanation of the difference. – kjhughes Oct 17 '14 at 13:13

3 Answers3

201

In XML 1.0, the XML Declaration is optional. See section 2.8 of the XML 1.0 Recommendation, where it says it "should" be used -- which means it is recommended, but not mandatory. In XML 1.1, however, the declaration is mandatory. See section 2.8 of the XML 1.1 Recommendation, where it says "MUST" be used. It even goes on to state that if the declaration is absent, that automatically implies the document is an XML 1.0 document.

Note that in an XML Declaration the encoding and standalone are both optional. Only the version is mandatory. Also, these are not attributes, so if they are present they must be in that order: version, followed by any encoding, followed by any standalone.

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" standalone="yes"?>
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>

If you don't specify the encoding in this way, XML parsers try to guess what encoding is being used. The XML 1.0 Recommendation describes one possible way character encoding can be autodetected. In practice, this is not much of a problem if the input is encoded as UTF-8, UTF-16 or US-ASCII. Autodetection doesn't work when it encounters 8-bit encodings that use characters outside the US-ASCII range (e.g. ISO 8859-1) -- avoid creating these if you can.

The standalone indicates whether the XML document can be correctly processed without the DTD or not. People rarely use it. These days, it is a bad to design an XML format that is missing information without its DTD.

Update:

A "prolog error/invalid utf-8 encoding" error indicates that the actual data the parser found inside the file did not match the encoding that the XML declaration says it is. Or in some cases the data inside the file did not match the autodetected encoding.

Since your file contains a byte-order-mark (BOM) it should be in UTF-16 encoding. I suspect that your declaration says <?xml version="1.0" encoding="UTF-8"?> which is obviously incorrect when the file has been changed into UTF-16 by NotePad. The simple solution is to remove the encoding and simply say <?xml version="1.0"?>. You could also edit it to say encoding="UTF-16" but that would be wrong for the original file (which wasn't in UTF-16) or if the file somehow gets changed back to UTF-8 or some other encoding.

Don't bother trying to remove the BOM -- that's not the cause of the problem. Using NotePad or WordPad to edit XML is the real problem!

Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181
Hoylen
  • 16,076
  • 5
  • 30
  • 16
  • My question was answered but my follow question was not. Do I need to create another question for that? or please add it here. – eros Aug 10 '11 at 08:24
  • 5
    The BOM can be the cause of the problem. Some older XML parsers will not accept a BOM at the start of a UTF-8 document (it was designed for UTF-16, and only became acceptable with UTF-8 later). But it's unlikely to be a problem if you're using a recent version of Xerces. – Michael Kay Aug 10 '11 at 10:44
  • Also note, that in the "Save As" dialog in notepad you can choose what encoding to save your XML as. If you want to remove the BOM, just save as "ASCII" (assuming you're not using any Unicode characters). For the lower 127 characters, ASCII and UTF-8 are identical. – BrainSlugs83 Sep 27 '13 at 08:47
9

Xml declaration is optional so your xml is well-formed without it. But it is recommended to use it so that wrong assumptions are not made by the parsers, specifically about the encoding used.

Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
  • 3
    Am I the only one that finds it bizarre that you tell XML parsers what encoding to use after they've already started decoding your document? I mean clearly, if it can parse that tag and understand what it says, then it has already figured out the correct encoding. I can't think of any legitimate use for the encoding attribute. – BrainSlugs83 Sep 27 '13 at 08:49
  • 2
    @BrainSlugs83 In no BOM, the encoding is specified to be 8-bit. So either ASCII or UTF-8 or any of them old 8-bit national encoding. XML declaration is all lower half 8-bit, which is equal among all those encodings and conveys enough infromation to choose the upper half. Not the best of design, but still better than guessing between, say, CP1241 and CP866 as was common for text files of them olden days. – Eugene Ryabtsev Oct 08 '15 at 06:57
  • But they should have gone clean and say XML is UTF-8 - end of story. – Lothar Jul 05 '16 at 14:10
5

It is only required if you aren't using the default values for version and encoding (which you are in that example).

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335