2

I am working with an API and for some crazy reason the XML being returned has & characters that are not correctly escaped. This has left me in an annoying position. I get an exception when i try to use an XMLDocument to parse the xml string.

I can use replace to get rid of the characters, but this could lead to issues.

xml = xml.Replace("&", "&").Replace("&", "&");

The problem with this is that there may end up being some escaped values. A node like this will cause the line of code above to get screwed up.

<node>Something & something &lt; annoying</node>

If i replace the & characters with amp; it will break lt;. I cant use the same approach for lt; as i did for the amp as it will mean that it will convert all of the <> brackets that i still need to get escaped.

Here is a node that is giving trouble.

<CompanyName>Fire & Ice</CompanyName>
Dan Hastings
  • 3,241
  • 7
  • 34
  • 71

2 Answers2

4

You can use a similar regex to this related question. This essentialy matches all unescaped ampersands (i.e. it will match &, but not &something;).

var xml = @"<node>Something & something &lt; annoying</node>";

var result = Regex.Replace(xml, @"&(?!\w*;)", "&amp;");

// output: <node>Something &amp; something &lt; annoying</node>
Community
  • 1
  • 1
Charles Mager
  • 25,735
  • 2
  • 35
  • 45
  • That does not work in all scenarios. Consider this &hello will not be replaced with &hello – fahadash Jun 28 '16 at 09:43
  • @fahadash yes, it will. The negative lookahead requires a word and a semi-colon, and `&hello` doesn't match that. – Charles Mager Jun 28 '16 at 09:44
  • @fahadash I'm not sure what you're saying. It's the *negative* lookahead that doesn't get matched, so it *will* be replaced. `&hello` will be replaced by `&hello`. The former is invalid, so surely this is what is expected? – Charles Mager Jun 28 '16 at 10:21
  • Sorry I meant to say `&hello;` (with the semi-colon), we do want it to be `&hello;` although your solution is very close and might suffice for the OP. +1 – fahadash Jun 28 '16 at 10:42
  • @fahadash `&hello;` is a valid XML entity reference can can be handled as such by a DTD or tooling. I'm not sure you could say for sure you'd want it to be replaced. – Charles Mager Jun 28 '16 at 10:46
  • DTD does not specify what chars are valid and what aren't. XML specification does, and & should always be represented in form of `&` regardless of whether or not the modern parsers can handle it. Source: Chapter 1 of https://amzn.com/1118162137 – fahadash Jun 28 '16 at 10:50
  • 1
    @fahadash I'm not talking about *characters*, I'm talking about *entity references*. `&hello;` is a valid *entity reference*, and a DTD can specify what that maps to. See [this](http://www.w3schools.com/xml/xml_dtd_entities.asp) for some examples. – Charles Mager Jun 28 '16 at 10:54
-1

I recommend to you XElement.XElement is useful object.XElement.Value will return string that you want.

using System.Xml.Linq;
XElement y = new XElement("CompanyNames",
                new XElement("CompanyName", "Fire & Ice")
                );
foreach (var item in y.Elements("CompanyName"))
{
   Console.WriteLine(item.Value);
}  

Output will be "Fire & Ice"

Mehmet
  • 739
  • 1
  • 6
  • 17