2

Regular expression to match ">", "<", "&" chars that appear inside XML nodes

I have an almost indentical problem to this - however, I am using C#.

I'm not here to argue the validity of the XML.

What gets sent in is out of my control.

Input XML:

<PNODE> 
  <CNODE>This string contains > and < and & chars.</cnode> 
</PNODE> 

I need it to look like this:

<PNODE> 
  <CNODE>This string contains &gt; and &lt and &amp; chars.</CNODE> 
</PNODE> 

It looks like the guy found a solution for PHP- which doesn't help me.

However, I need to find a way escape the &, > and < characters inside the node, but leave the tag declarations alone.

Community
  • 1
  • 1
user234702
  • 229
  • 6
  • 17
  • In addition - since there seems to be some confusion: replace, htmlencode, securityelement.escape, xmltextwriter, etc - none of these common methods will work as they replace the node declaration tags as well. – user234702 Sep 15 '10 at 18:56

6 Answers6

1

Check out Tidy.Net. It's a .Net implementation of Tidy.

Nathan Wheeler
  • 5,896
  • 2
  • 29
  • 48
  • While I'm sure I could find a .Net Tidy solution, which would hopefully work - is there another way to do this that doesn't require the Tidy Library or any other third party additions? – user234702 Sep 15 '10 at 18:59
  • You could build your own implementation. This poorly formatted XML won't load into an XmlDocument, so you'd have to load it as a string, and automate a process to determine which items should be replaced, and which are a valid part of the document. Theoretically, any & could be replaced, and any > without a < before it, and any < with another < after it. Where any parser is going to run into problems will be cases like: `This string contains and & chars.` where the `` would be missed. From that you MIGHT be able to catch that the isn't closed... – Nathan Wheeler Sep 15 '10 at 19:07
  • 1
    Simply put, I wouldn't try to reinvent the wheel. – Nathan Wheeler Sep 15 '10 at 19:08
0

There's a couple of .Net wrappers around the tidy library.

http://users.rcn.com/creitzel/tidy.html#dotnet

http://www.codeproject.com/KB/mcpp/eftidynet.aspx

And there is a .Net Port of tidy.

Kevin LaBranche
  • 20,908
  • 5
  • 52
  • 76
  • While I'm sure I could find a .Net Tidy solution, which would hopefully work - is there another way to do this that doesn't require the Tidy Library or any other third party additions? – user234702 Sep 15 '10 at 18:49
  • As mentioned by @md5sum, although possible to do on your own, why reinvent the wheel. Do you have an underlying requirement where you can't use a 3rd party library / solution? Especially like Tidy .Net since it's open source? – Kevin LaBranche Sep 15 '10 at 19:26
0

Use the HTTPUtility.

HttpUtility.HtmlEncode("<text to Encode>");
mledbetter
  • 88
  • 4
0

You should have a look at SgmlReader:

http://developer.mindtouch.com/SgmlReader

It will give you exactly what you wants :) I use it here: http://www.xmltools.dk/HtmlToXml try it :) (you can disable the html tag and the uppercase-tags->lowercase-tags conversion.)

Lasse Espeholt
  • 17,622
  • 5
  • 63
  • 99
0

I've always just used replace for XML (saves me having to bring in HTTP libraries):

string output = inputXml.Replace("&", "&amp;")
                        .Replace("<", "&lt;")
                        .Replace(">", "&tg;")
                        .Replace("'", "&apos;")     // optional
                        .Replace("\"", "&Quot;")    // optional
Zippit
  • 1,673
  • 1
  • 11
  • 11
  • If he were goign to do this he should probably just use `HTTPUtility.HtmlEncode` but it's already been established that this won't work... – Abe Miessler Sep 15 '10 at 18:46
  • but this will affect your nodes as well. I missed that part of your question. – Zippit Sep 15 '10 at 18:48
  • inputXml is the *entire* xml. This will replace perfectly good XML along with the unwanted characters in the content. – Anthony Pegram Sep 15 '10 at 18:48
0

I'm not here to argue the validity of the XML.

As with that other question, the right answer is that what you got sent is not XML. It's a question of well-formedness, not a question of validity in the XML sense.

What gets sent in is out of my control.

That may be true, but if someone sent you a quart of used motor oil and asked you to transform it into HTML, would you still accept it? Usually data interchange is done based on a contract (formal or informal), that the interchanged data will adhere to certain criteria. If it doesn't live up to the agreed-upon criteria, the data can be sent back, rejected.

If you're not requiring XML as input, this question is not about "<, & chars that appear inside XML nodes". Rather, it's about parsing SGML that looks a lot like XML, but which has < and & chars that appear in text content.

And to do that, .NET Tidy and SGMLReader are good solutions, as others have said.

Community
  • 1
  • 1
LarsH
  • 27,481
  • 8
  • 94
  • 152