1

I have a XHTML string I want to replace tags in for example

<span tag="x">FOO</span> 
<span tag="y"> <b>bar</b> some random text <span>another span</span> </span>

I want to be able to find tag="x" and replace FOO with my own content and find tag=y and replace all the inner content with by own content.

What is the best way to do this? I am thinking regex is definitely out of the question. Can XPATH do this or is that just for searching can it do manipulation?

mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
Daveo
  • 19,018
  • 10
  • 48
  • 71

2 Answers2

4

If you're sure the content is XHTML (i.e. well-formed XML) then XPath can certainly do it.

var doc = new XmlDocument();
doc.LoadXml("<span tag=...");

foreach(var node in doc.SelectNodes("//span[tag=x]"))
{
    node.InnerXml = "New Content";
}
foreach(var node in doc.SelectNodes("//span[tag=y]"))
{
    node.InnerXml = "Different Content";
}
Dean Harding
  • 71,468
  • 13
  • 145
  • 180
0

You can surely do this using regular expressions (it is a string manipulation afterall), but that may get a bit nasty, because HTML can be quite complicated. However, it is certainly a possible approach.

An alternative would be to parse the XHTML page into some structured hieararchy and then do the processing. The question is whether the pages are really valid XML. The XHTML specification requires that, but if you'll pick random page from the internet that claims to be XHTML, you may run into troubles.

  • If no, then you need to parse them as HTML, which can be done using Html Agility Pack.
  • If yes, then you can treat it as XML and use standard .NET classes to parse it.

The second case could be done using LINQ to XML like this:

var xs = from span in doc.Descendant("span")
         let tag = span.Attribute("tag")
         where tag != null && tag.Value == "x" select span;
forach(var x in xs) x.Value = "BAR!";

The obvious benefit is that this is much more readable and maintainable than a solution that would use regular expressions. Html Agility Pack provides a similar API (although I'm not familiar with it to write a sample).

Tomas Petricek
  • 240,744
  • 19
  • 378
  • 553
  • 1
    [No](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). **You CAN'T do it with regular expressions**. – SLaks May 24 '10 at 01:35
  • 1
    This has to be linked when HTML and RegEx are mentioned in the same answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Nick Craver May 24 '10 at 01:37
  • Hehe, great reference :-), but there _are_ cases where I would use regular expressions (if it wasn't _really_ XML and I needed a quick hack rather than solid solution). The title should really be **You'll burn in hell if you do it using regular expressions**. To me, "can't" and "regular expressions" in one sentence suggests that there should be a proof ;-) – Tomas Petricek May 24 '10 at 01:39
  • @John Saunders: I see that he means "XHTML", but this is the world of so called "web standards". – Tomas Petricek May 24 '10 at 01:42
  • @Tomas: I think there's a fair chance that something calling itself XHTML will a some point be consumed by an XML parser, which, if it's not valid XML, will tell you. I see no reason to confuse readers by suggesting there are valid times to use regular expressions when parsing XHTML. – John Saunders May 24 '10 at 01:45
  • Yes I give one vote for Tomas as it is a valid point the file may not be valid XML ( I will have to double check this as it is user provided content from ckEditor) Thanks for providing the LINQ code sample and showing me about Html Agility Pack. Thank you, – Daveo May 24 '10 at 03:44