2

Using C# - WinForms

I have a valid HTML string which may or may not contain various HTML elements such as <a>.

I need to search this HTML and highlight certain keywords - the highlighting is done by adding a <span> around the text with inline styling. I should not be doing this for <a> tags, or any other HTML tag that isn't actually visible to the user.

e.g. currently I am doing this:

html = html.Replace(phraseToCount, "<span style=\"background: #FF0000; color: #FFFFFF; font-weight: bold;\">" + phraseToCount + "</span>");

This kind of works but it breaks <a> tags. So in the example below only the 1st instance of the word cereal should end up with a <span> around it:

<p>To view more types of cereal click <a href="http://www.cereal.com">here</a>.</p>

How could I do this?

EDIT - more info.

This will be running in a Winforms app as the best way to get the HTML is using the WebBrowser control - I will be scraping web pages and highlighting various words.

Percy
  • 2,855
  • 2
  • 33
  • 56

2 Answers2

5

You're handling HTML as plain text. You don't want that. You only want to search through the "InnerText" of your HTML elements, as in <p attribute="value">innertext</p>. Not through tags, comments, styles and script and whatever else can be included in your document.

In order to do that properly, you need to parse the HTML, and then obtain all elements' InnerTexts and do your logic on that.

In fact, InnerText is a simplification: when you have an element like <p>FooBar<span>BarBaz</span></p> where "Baz" is to be replaced, then you need to actually recursively iterate all the nodes in the DOM, and only replace text nodes, because writing into the InnerText property will remove all child nodes.

For how to do that, you'd want to use a library. You don't want to build an HTML parser on your own. See for example C#: HtmlAgilityPack extract inner text, Extracting Inner text from HTML BODY node with Html Agility Pack, How can i parse InnerText of <option> tag with HtmlAgilityPack?, Parsing HTML with CSQuery, HtmlAgilityPack - get all nodes in a document and so on.

Most importantly seems to be How can I retrieve all the text nodes of a HTMLDocument in the fastest way in C#?:

HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");

foreach (HtmlTextNode node in coll.Cast<HtmlTextNode>())
{
    node.Text = node.Text.Replace(...);
}
Jinjinov
  • 2,554
  • 4
  • 26
  • 45
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • If the innerText is replaced with an innerText with all modifications done, the other elements inside it will be lost, right? – wingerse Apr 29 '16 at 14:54
  • 1
    @Emperor if you mean `

    FooBarBarBaz

    ` where `"Baz"` is to be replaced, then yes, you need to actually iterate the child nodes recursively and only replace `text` nodes.
    – CodeCaster Apr 29 '16 at 14:56
  • Oh, the `text` nodes. Thanks – wingerse Apr 29 '16 at 14:59
0

Here's how you would do what @CodeCaster suggested in CSQuery

string str = "<p>To view more types of cereal click <a href=\"http://www.cereal.com\">here cereal</a>.</p>";
var cq = CQ.Create(str);
foreach (IDomElement node in cq.Elements)
{
    PerformActionOnTextNodeRecursively(node, domNode => domNode.NodeValue = domNode.NodeValue.Replace("cereal", "<span>cereal</span>"));
}
Console.WriteLine(cq.Render());


private static void PerformActionOnTextNodeRecursively(IDomNode node, Action<IDomNode> action)
{
    foreach (var childNode in node.ChildNodes)
    {
        if (childNode.NodeType == NodeType.TEXT_NODE)
        {
            action(childNode);
        }
        else
        {
            PerformActionOnTextNodeRecursively(childNode, action);
        }
    }
}

Hope it helps.

wingerse
  • 3,670
  • 1
  • 29
  • 61