3

Can an XML start with anything other than a < character?

It was a random thought I just had, when I was trying to define how to differentiate a string containing a XML and one containing a path to a XML.

I believe the answer is no, but I'm looking to be certain.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Kilazur
  • 3,089
  • 1
  • 22
  • 48
  • 1
    It can start with a whitespace and still be valid. – baao Mar 14 '18 at 11:10
  • Note that adding whitespace to the start of an XML file can still invalidate it. In XML 1.0 an XML declaration is [optional](https://stackoverflow.com/a/7007781/446106), but if it has one, then there must not be any whitespace before it. – mwfearnley Feb 20 '23 at 12:41

2 Answers2

6

Only a < or a whitespace character can begin a well-formed XML document.

The W3C XML Recommendation includes a EBNF which definitively defines an XML document:

 [1] document ::= prolog element Misc*
[22] prolog   ::= XMLDecl? Misc* (doctypedecl Misc*)?
[23] XMLDecl  ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
[27] Misc     ::= Comment | PI | S
 [3] S        ::= (#x20 | #x9 | #xD | #xA)+

From these rules it follows that an XML document may start with a whitespace character or a < character from any one of the following constructs:

  • XML Declaration
  • Comment
  • PI
  • Doctype Declaration
  • Element

An XML document may start with no other character.

Notes:

  1. An implication of these rules is that if an XML document contains an XML declaration, it must appear at the top (or you could receive a somewhat cryptic error message). So, for XML documents with an XML declaration, the first character will have to be a < and cannot be whitespace.
  2. A BOM may appear at the beginning of an XML document entity to indicate the byte order of the character encoding being used. These two bytes are typically not considered to be part of the XML document itself but rather the storage unit of the physical structure supporting the XML document. A BOM, along with an XML declaration, assist XML processors in character encoding detection. [Suggestion for BOM mention thanks to JonHanna]
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • 2
    To answer the question directly: The first *non-S* character must be a `<`. Correct? – Shnugo Mar 14 '18 at 11:24
  • 2
    @Shnugo: Correct. – kjhughes Mar 14 '18 at 11:44
  • 2
    I'd add that anything checking at a lower level that something might be XML should also look for BOM. Strictly the BOM isn't part of the text, so the above is all correct, but lower-level code needs to make sure it handles a BOM too. – Jon Hanna Mar 14 '18 at 12:27
  • @Shnugo technically correct, but non-S characters are only allowed before the `<` if there is no XML declaration. – mwfearnley Feb 20 '23 at 12:45
2

A well-formed XML document entity always has "<" as its first non-whitespace character.

A well-formed external general parsed entity need not start with "<".

So if by "a XML" you mean "a well-formed XML document entity", then the answer is "no".

Michael Kay
  • 156,231
  • 11
  • 92
  • 164