1

I have HTML in a string that looks like this:

<div id="control">
    <a href="/xx/x">y</a>
    <ul>
        <li><a href="/C003Q/x" class="dw">x</a></li>
        <li><a href="/C003R/xx" class="dw">xx</a></li>
        <li><a href="/C003S/xxx" class="dw">xxx</a></li>
    </ul>
</div>

I would like to change this to the following:

<div id="control">
    <a data-href="/xx/x" ><span>y</span></a>
    <ul>
        <li><a data-href="/C003Q/x" class="dw"><span>x</span></a></li>
        <li><a data-href="/C003R/xx" class="dw"><span>xx</span></a></li>
        <li><a data-href="/C003S/xxx" class="dw"><span>xxx</span></a></li>
    </ul>
</div>

I heard about regex but I am not sure how I can use it to change something inside the address tags and to change href at the same time. Would I need to use regex twice and can I change the inside of the <a ... >...</a> using regex or is there an easier way with C#?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • You can treat the Html as a XML and use XMLReader to edit the text in the elements. Take a look at XmlDocument class. – Stian Standahl Dec 20 '12 at 10:03
  • you could have a look at [how can I remove with specific tags from html][1] and at [How to use HTML Agility pack][2] [1]: http://stackoverflow.com/questions/13955247/how-can-i-remove-with-specific-tags-from-html/13955316#comment19261877_13955316 [2]: http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack – czioutas Dec 20 '12 at 10:05
  • @StianStandahl - That falls over if the HTML is not valid XML (no root element, elements that are _valid_ HTML, such as `
    ` etc...
    – Oded Dec 20 '12 at 10:06
  • Why not use XDocument instead of XmlDocument? If the html is well formed you can use either of those –  Dec 20 '12 at 10:09
  • @Oded - for html 4. Yes. With Html 5 it should work i think. But i didnt consider the unclosed elements. Good catch :) – Stian Standahl Dec 20 '12 at 10:10
  • I strongly recommend to manipulate HTML with something more powerful than just strings and regular expressions. i.e. http://htmlagilitypack.codeplex.com/ – Salaros Dec 20 '12 at 10:16

3 Answers3

2

Regex is, in general, not suitable for parsing HTML, the exception being well known and well structured HTML (ie. you know exactly what you are trying to parse).

There are HTML parsers that you can use - the HTML Agility Pack is a popular option, and there also CsQuery.


What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).


CsQuery - .C# jQuery Port for .NET 4

CsQuery is a jQuery port for .NET 4. It implements all CSS2 & CSS3 selectors, all the DOM manipulation methods of jQuery, and some of the utility methods. The majority of the jQuery test suite (as of 1.6.2) has been ported to C#.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
1

You can use a regular expression replace. Use parentheses to catch values in the text that you match, and use $1, $2 et.c. to use the values in the replacement string:

str = Regex.Replace(
  str,
  "<a href=\"(.+?)\" class=\"dw\">(.+?)</a>",
  "<a data-href=\"$1\" class=\"dw\"><span>$2</span></a>"
);

Note: If the HTML code doesn't have that exact same form, the replace won't work. If there for example is another attribute in the anchor tag, or if the attribue order is reversed, the pattern won't match.

Matthew Strawbridge
  • 19,940
  • 10
  • 72
  • 93
Guffa
  • 687,336
  • 108
  • 737
  • 1,005
0

If you don't want to use a Regex, you could do:

string newString = oldString.Replace("dw\">", "dw\"><span>")
                            .Replace("</a", "</span></a");
Thorsten Dittmar
  • 55,956
  • 8
  • 91
  • 139