Parsing xml string to an xml document fails if the string begins with section

Question

I have an XML file begining like this:

<?xml version="1.0" encoding="utf-8"?>
<Report xmlns:rd="http://schemas.microsoft.com/SQLServer/reporting/reportdesigner" xmlns="http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition">
  <DataSources>

When I run following code:

byte[] fileContent = //gets bytes
            string stringContent = Encoding.UTF8.GetString(fileContent);
            XDocument xml = XDocument.Parse(stringContent);

I get following XmlException:

Data at the root level is invalid. Line 1, position 1.

Cutting out the version and encoding node fixes the problem. Why? How to process this xml correctly?

score 27 · Answer 1 · edited Jan 03 '17 at 10:36

27

My first thought was that the encoding is Unicode when parsing XML from a .NET string type. It seems, though that XDocument's parsing is quite forgiving with respect to this.

The problem is actually related to the UTF8 preamble/byte order mark (BOM), which is a three-byte signature optionally present at the start of a UTF-8 stream. These three bytes are a hint as to the encoding being used in the stream.

You can determine the preamble of an encoding by calling the GetPreamble method on an instance of the System.Text.Encoding class. For example:

// returns { 0xEF, 0xBB, 0xBF }
byte[] preamble = Encoding.UTF8.GetPreamble();

The preamble should be handled correctly by XmlTextReader, so simply load your XDocument from an XmlTextReader:

XDocument xml;
using (var xmlStream = new MemoryStream(fileContent))
using (var xmlReader = new XmlTextReader(xmlStream))
{
    xml = XDocument.Load(xmlReader);
}

edited Jan 03 '17 at 10:36

Ian Kemp

28,293
19
112
138

answered Jan 21 '10 at 18:04

Dave Cluderay

7,268
1
29
28

1

Note that the UTF-8 ‘pre-amble’ is a Microsoft invention that is not endorsed by any Unicode standard, unlike the normal UTF-16 BOMs. It should never be used on writing, though you will have to handle it on reading as you will often meet the pesky blighter in the wild. – bobince Jan 21 '10 at 22:09
1

@bobince - I agree (although it is allowed for by the Unicode standard, but its use is discouraged - see page 36 of http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G19273 for more information). – Dave Cluderay Jan 21 '10 at 22:38
I've amended the answer - see the last paragraph. – Dave Cluderay Jan 22 '10 at 08:39

stevehipwell · Accepted Answer · 2010-01-22T15:55:33.893

17

If you only have bytes you could either load the bytes into a stream:

XmlDocument oXML;

using (MemoryStream oStream = new MemoryStream(oBytes))
{
  oXML = new XmlDocument();
  oXML.Load(oStream);
}

Or you could convert the bytes into a string (presuming that you know the encoding) before loading the XML:

string sXml;
XmlDocument oXml;

sXml = Encoding.UTF8.GetString(oBytes);
oXml = new XmlDocument();
oXml.LoadXml(sXml);

I've shown my example as .NET 2.0 compatible, if you're using .NET 3.5 you can use XDocument instead of XmlDocument.

Load the bytes into a stream:

XDocument oXML;

using (MemoryStream oStream = new MemoryStream(oBytes))
using (XmlTextReader oReader = new XmlTextReader(oStream))
{
  oXML = XDocument.Load(oReader);
}

Convert the bytes into a string:

string sXml;
XDocument oXml;

sXml = Encoding.UTF8.GetString(oBytes);
oXml = XDocument.Parse(sXml);

edited Jan 22 '10 at 15:55

answered Jan 22 '10 at 09:12

stevehipwell

56,138
6
44
61

the problem is I need to use XDocument – agnieszka Jan 22 '10 at 14:46
@agnieszka - I've updated my answer to walk you through how to use the XDocument. – stevehipwell Jan 22 '10 at 15:56
1

string has to be modified if original `oBytes` contains Byte order mark sequence. I had to call `sXml = sXml.Substring(1);` otherwise error `Data at the root level is invalid. Line 1, position 1.` is thrown on `XDocument.Parse`. BOM bytes are not visible so can be checked using `.WriteLine("first char '{0}'", sXml[0])` – oleksa Oct 22 '18 at 08:47

score 7 · Answer 3 · answered Jan 21 '10 at 18:00

7

Do you have a byte-order-mark (BOM) at the beginning of your XML, and does it match your encoding ? If you chop out your header, you'll also chop out the BOM and if that is incorrect, then subsequent parsing may work.

You may need to inspect your document at the byte level to see the BOM.

answered Jan 21 '10 at 18:00

Brian Agnew

268,207
37
334
440

what is a byte-order-mark...? and how can I find out document's encoding? I just suspect it is utf-8 (read text is readable) – agnieszka Jan 21 '10 at 18:01
See the link I posted. It's a sequence of bytes *before* the header that acts as a directive to the encoding of the document. – Brian Agnew Jan 21 '10 at 18:02

score 7 · Answer 4 · answered Jan 21 '10 at 18:02

7

Why bothering to read the file as a byte sequence and then converting it to string while it is an xml file? Just leave the framework do the loading for you and cope with the encodings:

var xml = XDocument.Load("test.xml");

answered Jan 21 '10 at 18:02

Darin Dimitrov

1,023,142
271
3,287
2,928

7

Because I don't get the xml from a path. I just have bytes content – agnieszka Jan 22 '10 at 07:46
And where are those bytes coming from? Database, network stream, ...? – Darin Dimitrov Jan 22 '10 at 09:55

score 2 · Answer 5 · edited Jul 29 '16 at 09:54

2

Try this:

int startIndex = xmlString.IndexOf('<');
if (startIndex > 0)
{
    xmlString = xmlString.Remove(0, startIndex);
}

edited Jul 29 '16 at 09:54

Filburt

17,626
12
64
115

answered Jul 09 '13 at 15:38

eugene.sushilnikov

1,795
2
12
9

2

Would help if you explained that this was to forcefully skip the preamble/BOM. – binki Sep 13 '13 at 13:50

score 1 · Answer 6 · answered May 27 '21 at 09:55

I have also encountered this error because the source XML was a string that somehow got some non-printable characters that seemed to break XmlDocument or XDocument parsing. Stripping them fixed the issue:

string sanitized = Regex.Replace(part, @"\p{C}+", string.Empty);

Credit: C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

Parsing xml string to an xml document fails if the string begins with section

6 Answers6

Linked