Regular Expression Negative Lookahead/Lookbehind to Exclude HTML from Find-and-Replace

Question

I have a feature on my site where search results have the search query highlighted in results. However, some of the fields that the site searched through has HTML in it. For example, let's say I had a search result consisting of Hello all. If the user searched for the letter a, I want the code to return Hello aall instead of the messy <span>Hello all</span> that it would return now.

I know that I can use negative lookbehinds and lookaheads in preg_replace() to exclude any instances where the a is between a < and >. But how do I do that? Regular expressions are one of my weaknesses and I can't seem to come up with any that work.

So far, what I've got is this:

$return = preg_replace("/(?<!\<[a-z\s]+?)$match(?!\>[a-z\s]+?)/i", '<mark>'.$match.'</mark>', $result);

But it doesn't seem to work. Any help?

**Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester, Mar 20 '13 at 14:47
I see. How would I use a parsing module to do a find-and-replace on just the text within a string of HTML? — TerranRich, Mar 20 '13 at 14:49
Possible solution using DomDocument: http://stackoverflow.com/questions/9335689/dom-parser-to-highlight-keywords-not-working — SDC, Mar 20 '13 at 15:07
@SDC, that solution works beautifully! Thanks! Would you like to post it as an answer so I can give you credit? — TerranRich, Mar 20 '13 at 16:06
@TerranRich - well, I don't really deserve the credit, but I'll post a brief answer anyway. :-) hold on..... — SDC, Mar 20 '13 at 16:25
Okay. posted an answer; feel free to accept it. But please upvote the answer in the linked question as well; he's the one who did the hard work. — SDC, Mar 20 '13 at 16:28

MikeM · Answer 1 · 2013-03-20T15:18:23.953

1

If you do want to use regular expressions, a simple negative look-ahead is all that is required (assuming well-formed markup with no < or > within or between the tags)

$return = preg_replace("/$match(?![^<>]*>)/i", '<mark>$0</mark>', $result);

Any special regular expression characters in $match will need to be properly escaped.

edited Mar 20 '13 at 15:18

answered Mar 20 '13 at 15:10

MikeM

13,156
2
34
47

score 1 · Accepted Answer · edited May 23 '17 at 11:50

1

It's considered bad practice to use regex to parse a complex language like HTML. With sufficient skill and patience, and an advanced regex engine, it may be possible, but the potential pitfalls are huge and the performance is unlikely to be good.

A better solution is to use a dom parser such as PHP's built-in DOMDocument class.

A good example of this can be found here in the answer to this related SO question.

Hope that helps.

edited May 23 '17 at 11:50

Community

1
1

answered Mar 20 '13 at 16:27

SDC

14,192
2
35
48

This definitely helped. I re-wrote the Search functionality to parse the HTML instead of just plain ol' `preg_replace()`ing keywords to surround them with `span` tags. Thank you so much! – TerranRich Mar 20 '13 at 17:28

Regular Expression Negative Lookahead/Lookbehind to Exclude HTML from Find-and-Replace

2 Answers2

Linked