highlight words in html using regex in C#

Question

I found this article on stackoverflow

highlight words in html using regex & javascript - almost there

Using the article above, I am trying to highlight HTML text on the server using c#. The code is shown below:

string replacePattern = "$1<span style=\"background-color:yellow\">$2</span>";
string searchPattern = String.Format("(?<=^|>)(.*?)({0})(?=.*?<|$)", searchString.Trim());
content = Regex.Replace(content, searchPattern, replacePattern, RegexOptions.IgnoreCase);

The code seems to work great except when trying to highlight a word that is contained in an image source:

Search Keyword:

ABC

Search Text:

<div><img src="/site/folder/ABC.PNG" /><br />ABC</div>

The result will highlight both the text and the image name.

Any help would be greatly appreciated.

[Obligatory link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use a DOM parser like [HTMLAgilityPack](http://htmlagilitypack.codeplex.com/). — Sam, May 19 '14 at 19:41
@Sam I am only looking to highlight text within HTML. Dont you think parsing would be an overkill? — malkassem, May 19 '14 at 19:45
Nope, how are you supposed to know how "far up" you need to go in the HTML to highlight the image and title. How are you supposed to make sure you do not put an inline `span` around a block `div`. HTML is not a regular language, and irregular things *will* happen eventually. Make it more robust and parse it. However, depending on how you render your templates, maybe you can highlight while rendering the HTML..rather than after-the-fact. — Sam, May 19 '14 at 19:50

mrk · Accepted Answer · 2014-05-19T22:11:12.857

I'll offer up a solution, but I agree that solely using Regex for parsing HTML can eventually not be worth the effort. That said, you know more about your problem space than the rest of us, so if the HTML you're highlighting is under your control you may be able to test enough of your domain to achieve what you want with regexes.

My solution changes the regex you've supplied to take this approach:

Match and capture into $1 the > char, non-greedy capture chars not in set [<>]
Match and capture your keyword into $2
Match and capture into $3 non-greedy chars not in set [<>], plus the < char

Caveats:

well-formed HTML works best, if this html is User-Generated content (UGC), then, good luck you should've used an HTML parser :)
this would highlight content within <textarea>...</textarea>
this would highlight content within <script>...</script>

Note you could expand the capture on the lefthand side to capture the tag name and conditionally not replace for a set of tags like textarea and script.

string searchString = "ABC";
string content = "<div><img src='/site/folder/ABC.PNG' /><br />ABC</div>";
string replacePattern = "$1<span style=\"background-color:yellow\">$2</span>$3";
string searchPattern = String.Format("(>[^<>]*?)({0})([^<>]*?<)", searchString.Trim());
content = Regex.Replace(content, searchPattern, replacePattern, RegexOptions.IgnoreCase);
Console.WriteLine(content);

That was a great solution... Thank you! I have not script tags nor textarea tags in my HTML, so I should not have to worry about that. — malkassem, May 20 '14 at 12:44

highlight words in html using regex in C#

1 Answers1

Linked