2

I'm receiving data via an XML API and it's returning a node like the following:

<?xml version='1.0' encoding='utf-8' ?>

<location>
  <name>&Oslash;L Shop</name>
</location>

I have no control over the response but I am trying to Load it into an XDocument in which it fails due to the invalid character.

Is there anything I can do to make this load properly? I want to keep the solution as general as possible because it is possible other invalid characters exist.

Thoughts?

aherrick
  • 19,799
  • 33
  • 112
  • 188
  • 2
    Honestly, you should query the producer of the xml file to generate a valid xml file. You may succeed in patching the input, but this is a non viable solution. – Steve B Apr 22 '13 at 14:06
  • I agree. The encoding used is valid only in HTML, not in an XML file. This character should be encoded as, e.g., `Ø`. – Tim S. Apr 22 '13 at 14:11
  • @SteveB I agree that the *real* solution here is to get the response fixed. However, I wouldn't go as far as saying it isn't a viable solution. It's pretty easy to unescape any invalid characters from the response before processing. In the future, **if** the 3rd party does fix the problem it just becomes a sanity check. It's also, technically, future proofing as they could also re-introduce issues which that check would catch. – James Apr 22 '13 at 14:11
  • 1
    It's pretty amazing that after so many years people *still* think they can produce valid XML when they really create text output that *looks* like XML... – Thorsten Dittmar Apr 22 '13 at 14:13
  • If the only invalid text will be HTML encodings where XML encodings should be used, perhaps you can search for those and replace them with the valid equivalents? – Tim S. Apr 22 '13 at 14:13

3 Answers3

1

You can use html parsers which are more tolerant to invalid inputs. For example; (using HtmlAgilityPack) this code works without any problem.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(xml);
var name = doc.DocumentNode.Descendants("name").First().InnerText;
I4V
  • 34,891
  • 6
  • 67
  • 79
1

You cant use "&" symbol in XDocument.Parse input text. Replace it with "&amp;" , like this

<?xml version='1.0' encoding='utf-8' ?>

<location>
  <name>&amp;Oslash;L Shop</name>
</location>
Alex
  • 8,827
  • 3
  • 42
  • 58
  • This is probably not the correct result. I'd expect it should've been `ØL Shop` (216 is the decimal Unicode value for `Ø`, which HTML-encoded is `Ø`) – Tim S. Apr 22 '13 at 14:16
  • `&` is still an unknown entity in the XML specification – Steve B Apr 22 '13 at 14:45
0

Why not just escape any invalid XML characters before you load the response into an XDocument? You could use a regex for this, should be relatively straight forward.

See escape invalid XML characters in C#

Community
  • 1
  • 1
James
  • 80,725
  • 18
  • 167
  • 237