HTML to RichTextBox as Plaintext with Hyperlinks

Question

Reading so much about not using RegExes for stripping HTML, I am wondering about how to get some Links into my RichTextBox without getting all the messy html that is also in the content that i download from some newspaper site.

What i have: HTML from a newspaper website.

What i want: The article as plain text in a RichTextBox. But with links (that is, replacing the <a href="foo">bar</a> with <Hyperlink NavigateUri="foo">bar</Hyperlink>).

HtmlAgilityPack gives me HtmlNode.InnerText (stripped of all HTML tags) and HtmlNode.InnerHtml (with all tags). I can get the Url and text of the link(s) with articlenode.SelectNodes(".//a"), but how should i know where to insert that in the plain text of HtmlNode.InnerText?

Any hint would be appreciated.

score 0 · Accepted Answer · answered Jun 03 '13 at 13:55

Here is how you can do it (with a sample console app but the idea is the same for Silverlight):

Let's suppose you have this HTML:

<html>
<head></head>
<body>
Link 1: <a href="foo1">bar</a>
Link 2: <a href="foo2">bar2</a>
</body>
</html>

Then this code:

HtmlDocument doc = new HtmlDocument();
doc.Load(myFileHtm);

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
    // replace the HREF element in the DOM at the exact same place
    // by a deep cloned one, with a different name
    HtmlNode newNode = node.ParentNode.ReplaceChild(node.CloneNode("Hyperlink", true), node);

    // modify some attributes
    newNode.SetAttributeValue("NavigateUri", newNode.GetAttributeValue("href", null));
    newNode.Attributes.Remove("href");
}
doc.Save(Console.Out);

will output this:

<html>
<head></head>
<body>
Link 1: <hyperlink navigateuri="foo1">bar</hyperlink>
Link 2: <hyperlink navigateuri="foo2">bar2</hyperlink>
</body>
</html>

Nice! This works, thank you. But i still have to strip my text from all other html-tags (img, ul, li, p, div...). The Regex `<[^a].*?>` matches every html-tag except links, but i also have to keep ``. I don't know how to get the OR-operator in there to make match every `<.*>` except OR ``. — baumschubser, Jun 04 '13 at 00:10
The answer to this question, btw, would be `<(?!a|/a)^>]+>`. I figured it all out now. — baumschubser, Jun 05 '13 at 22:18

HTML to RichTextBox as Plaintext with Hyperlinks

1 Answers1