1

I have the following XML (test example):

<?xml version="1.0" encoding="UTF-8"?><?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" >
<Styles>
<Style ss:ID="s21"><NumberFormat ss:Format="@"/></Style>
</Styles>
<Worksheet ss:Name="--">
<Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="1" x:FullColumns="1" x:FullRows="1" ss:StyleID="s21">
    <Column ss:StyleID="s21" ss:Width="184"/>
    <Row>   
        <Cell><ss:Data ss:Type="String">42</Data></Cell>
</Row></Table></Worksheet></Workbook>

When trying to read the file using DataSet.ReadXml(), the following exception is generated: The 'ss:Data' start tag on line 12 position 14 does not match the end tag of 'Data'. Line 12, position 43.
While all examples in W3C documentation show namespace-qualified end tags, MS Excel opens such file without any warnings.

Setting DataSet.Namespace = "ss"; doesn't change anything.

What can be done to read such file, preferably without adding extra libraries?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Abstraction
  • 1,108
  • 11
  • 25
  • Should be : – jdweng Aug 08 '17 at 16:39
  • @jdweng Yes, it seems that perfect XML shouldn't be like this. My question is: given this XML, without any freedom to change it, how should I parse it? Replacing all instances of "" with "" and all instances of "" with "" will probably work, but I wonder if there is an "easier" way. – Abstraction Aug 08 '17 at 16:51
  • @Abstraction: It's not just "perfect" XML that shouldn't be like this. *Any* XML *cannot* be like this, else it's not XML. – kjhughes Aug 08 '17 at 16:55

1 Answers1

1

Yes, XML end tags must match XML start tags exactly, including any namespace prefixes.

From your question:

What can be done to read such file, preferably without adding extra libraries?

The XML must be repaired to be well-formed if it's to be parsed successfully using compliant XML tools. In particular, you must change the the end-tag as @jdweng suggests in the comments: </ss:Data>

Per the W3C XML Recommendation, section 3.1:

[Definition: The end of every element that begins with a start-tag must be marked by an end-tag containing a name that echoes the element's type as given in the start-tag:]

From your question:

While all examples in W3C documentation show namespace-qualified end tags, MS Excel opens such file without any warnings.

Then MS Excel isn't processing the XML in a compliant manner and may well be missing other issues.

See also How to parse invalid (bad / not well-formed) XML?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thank you. Mainly I was afraid that I was missing something like `XmlReadMode` value while the file was actually a well-formed XML. Now I evidently have to use option 3. of your answer to the linked question. Is there a direct quote that start and end tags must be either both prefixed or both unprefixed? W3C page - https://www.w3.org/TR/REC-xml-names/ - gives the impression that one can expand `STag ETag` into `'<' QName (S Attribute)* S? '>' '' QName S? '>'` and then into `'<' PrefixedName '>' UnprefixedName '>'`. – Abstraction Aug 08 '17 at 17:19
  • Answer updated to show where in the XML Recommendation it says that end tag names must match start tag names. – kjhughes Aug 08 '17 at 17:28