0

I have this code for looping reports in a XML file:

Dim xmlr As XDocument = XDocument.Load("Myfile.xml")
For Each report As XElement In xmlr.Descendants("Report")
    'Do stuff with report values
Next

this works, but i get an error if the file contains chars like ÅÄÖ. The xml document has the encoding UFT-8;

<?xml version="1.0" encoding="utf-8"?>

I found this post here, and tried with this code instead, but it does not help;

Dim xmlr As XDocument

Using oReader As StreamReader = New StreamReader("Myfile.xml", Encoding.GetEncoding("UTF-8"))
    xmlr = XDocument.Load(oReader)
End Using

Any suggestions?

Community
  • 1
  • 1
gubbfett
  • 2,157
  • 6
  • 32
  • 54
  • 1
    Which error do you get exactly? If the document says it is UTF-8 in its declaration and it is really UTF-8 encoded then the XML parser won't complain. If you get an error then the umlauts are not properly UTF-8 encoded. So the workaround would be rather to try to use `Using oReader As StreamReader = New StreamReader("Myfile.xml", Encoding.Default)` which is usually an 8-bit Windows codepage like Windows-1252. – Martin Honnen Jun 11 '13 at 16:28
  • Of course the right approach is not fix the parsing step but rather to make sure the code producing the XML is fixed to properly encode the document. – Martin Honnen Jun 11 '13 at 16:33
  • Try to parse with `New StreamReader("Myfile.xml", Encoding.Default)` or `New StreamReader("Myfile.xml", Encoding.GetEncoding(1252))`. Or fix the code producing the XML. – Martin Honnen Jun 11 '13 at 16:38
  • I asked about the XML file given to me, an the xml file is produced from another program where only these options are avalible: "Text OEM(DOS)", "Text ANSI (Windows)", "Text EBCDIC", "PDF". I asked and it's set to "ANSI". Encoding.Default and Encoding.GetEncoding(1252) gives the same result. – gubbfett Jun 11 '13 at 16:40
  • 1
    Then check the file again at that line and position, I don't think the error is about the character, rather about entity references. Does the file have a ` ` node at the beginning? Are there any entity references `&foo;` in the XML document? Make sure you look at the file with a raw text editor or view source of the browser menu, don't rely on the browser display, it might hide details. – Martin Honnen Jun 11 '13 at 16:56
  • I even tried to re-save the file from notepad as UFT-8, and i still get this error. The XML looks like this ´ ´ – gubbfett Jun 11 '13 at 16:59
  • that is, no doctype... – gubbfett Jun 11 '13 at 16:59
  • I'll be damned, there's a "&". It actually has this value; "äöåü&'óâ-". It can't take & ? – gubbfett Jun 11 '13 at 17:02
  • Well with XML any ampersand to be included literally needs to be escaped as `&` because otherwise an ampersand starts a character or entity reference. You won't get far with using an XML parser to read that export, it is not XML if it fails to follow the well-formedness rules defined in the XML specification. – Martin Honnen Jun 11 '13 at 17:08
  • Ok, so is there a way to load this file and tell it to escape this char on .Load, or do i have to loop this file line by line first and escape/remove all ampersands? (of what i know, i don't think the program that generates the XML has that option to escape this itself, and they have some names called like "Father & Son Inc.", so there will be ampersands) – gubbfett Jun 11 '13 at 17:13
  • 1
    The .NET framework has tools to read and write XML in System.Xml (and below). Your file is not XML and the classes and APIs in System.Xml won't help with that file format, other than telling you where the errors are. – Martin Honnen Jun 11 '13 at 17:25
  • Well, that explains it. Make it as an answer and i set it as the solution. :-) Thank you – gubbfett Jun 11 '13 at 17:37

1 Answers1

0

Based on your comments the input document you are trying to process is not well-formed XML as it has unescaped ampersands & in element or attribute content. As the ampersand in XML syntax serves to start a character (e.g. &#160;) or entity reference (&lt;) it has to be escaped as &amp; if it should appear literally in content (e.g. <foo>a &amp; b</foo>) (and alternative is CDATA section <foo><![CDATA[a & b]]></foo>).

So the .NET framework's XML parser is doing the right thing by telling you the input you are trying to parse is not well-formed XML and telling you where the error is. That is all the APIs in System.Xml and below can do, they can read and write well-formed XML. There is no API to try to correct errors.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110