How can I wrap a around matched words in HTML without breaking the HTML

Question

Using C# - WinForms

I have a valid HTML string which may or may not contain various HTML elements such as <a>.

I need to search this HTML and highlight certain keywords - the highlighting is done by adding a  around the text with inline styling. I should not be doing this for <a> tags, or any other HTML tag that isn't actually visible to the user.

e.g. currently I am doing this:

html = html.Replace(phraseToCount, "<span style=\"background: #FF0000; color: #FFFFFF; font-weight: bold;\">" + phraseToCount + "</span>");

This kind of works but it breaks <a> tags. So in the example below only the 1st instance of the word cereal should end up with a  around it:

<p>To view more types of cereal click <a href="http://www.cereal.com">here</a>.</p>

How could I do this?

EDIT - more info.

This will be running in a Winforms app as the best way to get the HTML is using the WebBrowser control - I will be scraping web pages and highlighting various words.

You shouldn't really be messing around with raw HTML in C#. Let the view do that logic. — ManoDestra, Apr 29 '16 at 14:35
Do you have to do this server side? Can you use client script? This would be pretty easy with jQuery. — squillman, Apr 29 '16 at 14:35
There are html parsing libraries out there. I personally use CSQuery which is a jquery port — wingerse, Apr 29 '16 at 14:35
Has to be done using WinForms unfortunately. I've added that to the question. thanks. — Percy, Apr 29 '16 at 14:35
Lookup `HtmlAgilityPack'. that should help you do anything you could possibly want to do with html server side. — ScarletMerlin, Apr 29 '16 at 14:37

score 5 · Accepted Answer · edited Feb 08 '23 at 16:54

You're handling HTML as plain text. You don't want that. You only want to search through the "InnerText" of your HTML elements, as in innertext. Not through tags, comments, styles and script and whatever else can be included in your document.

In order to do that properly, you need to parse the HTML, and then obtain all elements' InnerTexts and do your logic on that.

In fact, InnerText is a simplification: when you have an element like FooBarBarBaz where "Baz" is to be replaced, then you need to actually recursively iterate all the nodes in the DOM, and only replace text nodes, because writing into the InnerText property will remove all child nodes.

For how to do that, you'd want to use a library. You don't want to build an HTML parser on your own. See for example C#: HtmlAgilityPack extract inner text, Extracting Inner text from HTML BODY node with Html Agility Pack, How can i parse InnerText of <option> tag with HtmlAgilityPack?, Parsing HTML with CSQuery, HtmlAgilityPack - get all nodes in a document and so on.

Most importantly seems to be How can I retrieve all the text nodes of a HTMLDocument in the fastest way in C#?:

HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");

foreach (HtmlTextNode node in coll.Cast<HtmlTextNode>())
{
    node.Text = node.Text.Replace(...);
}

If the innerText is replaced with an innerText with all modifications done, the other elements inside it will be lost, right? — wingerse, Apr 29 '16 at 14:54
@Emperor if you mean `
FooBarBarBaz
` where `"Baz"` is to be replaced, then yes, you need to actually iterate the child nodes recursively and only replace `text` nodes. — CodeCaster, Apr 29 '16 at 14:56

score 0 · Answer 2 · answered Apr 29 '16 at 16:04

Here's how you would do what @CodeCaster suggested in CSQuery

string str = "<p>To view more types of cereal click <a href=\"http://www.cereal.com\">here cereal</a>.</p>";
var cq = CQ.Create(str);
foreach (IDomElement node in cq.Elements)
{
    PerformActionOnTextNodeRecursively(node, domNode => domNode.NodeValue = domNode.NodeValue.Replace("cereal", "<span>cereal</span>"));
}
Console.WriteLine(cq.Render());


private static void PerformActionOnTextNodeRecursively(IDomNode node, Action<IDomNode> action)
{
    foreach (var childNode in node.ChildNodes)
    {
        if (childNode.NodeType == NodeType.TEXT_NODE)
        {
            action(childNode);
        }
        else
        {
            PerformActionOnTextNodeRecursively(childNode, action);
        }
    }
}

Hope it helps.

How can I wrap a around matched words in HTML without breaking the HTML

2 Answers2