2

(NB: the original question title was: What is the best way to load XML from a string with a document specification?)

I need to get the XML content from an ODT opendocument (LibreOffice) file in an XmlDocument object. The ODT is a zip archive and I managed to get the content.xml part as a byte array. Converting to a string seems simple, but I was surprised to find that XmlDocument.LoadXml(string) does not accept a string that starts with an Xml document specification line, like:

<?xml version="1.0" encoding="UTF-8"?>
<Offices id="0" enabled="false">
  <office />
</Offices>

The exception is: Data at the root level is invalid. Line 1, position 1

I wonder if there is a library call to read such a string?

For now I use this function I improvised, but it feels unnecessarily complex to have to do stuff on the character level when handling xml documents:

    /// <summary>
    /// Convert an Xml document in a string, including document specification line(s),
    /// to an XmlDocument object
    /// </summary>
    /// <param name="XmlString"></param>
    /// <returns></returns>
    public static XmlDocument LoadXmlString(string XmlString)
    {
        XmlDocument XmlDoc = new XmlDocument();
        XmlDoc.LoadXml(XmlString.Substring(XmlString.LastIndexOf("?>") + 2));
        return XmlDoc;
    }

Is there a better way?

NB: I refer to this earlier question

but this addresses the problem of parsing a string, with the solution of converting the string to a byte array, while I should not be parsing the string, and not convert the byte array to string to begin with, but just skip this step and directly parse the byte array after unzipping the ODT.

Community
  • 1
  • 1
Roland
  • 4,619
  • 7
  • 49
  • 81
  • It can, There is something wrong with your `xmlString`. I just tried your XML string in VS and it works – Habib Aug 21 '14 at 17:10
  • Have you tried Linq to XML? i.e. [XDocument.Parse](http://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.parse(v=vs.110).aspx) – James Aug 21 '14 at 17:11
  • Not quite, the earlier question is answered fairly complicated, at least the answer below is really useful to me, and I would at least have the option to mark that as an answer or to comment on it, so please unlock this question. – Roland Aug 21 '14 at 17:22
  • @Roland, nobody has locked this question, You are better of deleting it, because there is nothing wrong with your code. The sample I have posted in answer section is because I can't place that much code in the comments. – Habib Aug 21 '14 at 17:23

1 Answers1

6

With the new, more precise question title, the answer can be very simple:

just convert the unzipped byte array to XML without converting to a string first.

Simple, and no risk of encoding issues.

The background is that the content.xml part of an ODT file is not a string, but an XML document. LibreOffice zipped the Xml to the ODT archive, without first converting the XML to a string. The unzipping function does not know what is in the zipped data, and just unzips the compressed bytes to uncompressed bytes. The XmlDocument.Load() function does not care about the string representation, but learns from the document specification line in the data which encoding is applicable to parse the byte array to XML.


my original answer:

As I learned from the (deleted) post of Donal: the reason that is failing is because .Net strings are encoded with UTF-16 and your specification specifies UTF-8. As I actually started from a byte array, I should NOT try to make string with:

  string s = Encoding.UTF8.GetString(Bytes);

because this string cannot be accepted by LoadXml().

Instead I need Donal's solution code, simplified to:

    public XmlDocument GetEntryXmlDoc(byte[] Bytes)
    {
        XmlDocument xmlDoc = new XmlDocument();
        using (MemoryStream ms = new MemoryStream(Bytes))
        {
            xmlDoc.Load(ms);
        }
        return xmlDoc;
    }

I would like to refer to the earlier post mentioned by others, but I could not easily find the answer to my problem there, which is my fault, also because of impatience because I just found the answer here.

Roland
  • 4,619
  • 7
  • 49
  • 81
  • Your answer would improve, if you consider enclosing `MemoryStream` in `using` block – Habib Aug 21 '14 at 17:45
  • @Habib Agreed, but that is perhaps off-topic. The issue of this question is the encoding issue, which is invisible in my original code posting. Donal's solution does not encode anything, just converts bytes to XML, where the bytes include the encoding details in the xml document specification line. – Roland Aug 21 '14 at 17:51
  • I can perhaps simplify one step more, because I obtain the bytes from a Stream from the CSharpZipLib, which I could input to Load() directly, but that is off-topic. – Roland Aug 21 '14 at 17:56
  • I just found that I can even simplify one step more, because I obtain the bytes from a Stream from the CSharpZipLib, which I could input to Load() directly. I am now even more grateful for Donal's (deleted) contribution. – Roland Aug 21 '14 at 18:08
  • That answer was taken from http://stackoverflow.com/questions/310669/why-does-c-sharp-xmldocument-loadxmlstring-fail-when-an-xml-header-is-included – Habib Aug 21 '14 at 18:10
  • It is a duplicate and I have decided to close it as dupe – Habib Aug 21 '14 at 18:11
  • @Habib Are you the same as "Donal"? Or can you close the answers of other posters? – Roland Aug 21 '14 at 18:30
  • no I am not `Donal`, but I can see his/her answer and the comment under it. I can't close the answer, I can only vote to close questions. – Habib Aug 21 '14 at 18:34