1

We have a piece of regex that adds a <strong> tag around keywords if they are not within a certain closing tag themselves. This has always worked nicely...

foreach ($keywords as $keyword) {
    $str = preg_replace("/(?!(?:[^<]+>|[^>]+(<\/strong>|<\/a>|<\/b>|<\/i>|<\/u>|<\/em>)))\b(" . preg_quote($keyword, "/") . ")\b/is", "<strong>\\2</strong>", $str, 1);
}

So if the keyword was test this would change:

A test line

to:

A <strong>test</strong> line

but this would not change:

<a href="">A test line</a>

As you can see the list of closing tags we want it to ignore is in the regex.

We have encountered a problem with a string that looks like:

<a href="">A test <em>line</em></a>

It's not recognising the closing </a> or </em> for that matter, so it's coming out as...

<a href="">A <strong>test</strong> <em>line</em></a>

Which we don't want it to do. Can anyone see if there is a fix to this (and yes I am aware of the don't parse HTML with regex rule so posting links to that infamous post is not an answer ;-))

fire
  • 21,383
  • 17
  • 79
  • 114
  • 1
    Sorry, but... If you are aware of it, why are you STILL parsing HTML with Regex? ;-) – Daniel Hilgarth Jul 08 '11 at 08:44
  • 2
    I'm aware that you're aware of the don't parse HTML with regex rule so I will post the link to the infamous post anyway: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Roberto Aloi Jul 08 '11 at 08:57

1 Answers1

2

The following regex try to match the keyword test not enclosed by either a,b,i,u,em,strong tags.

Regex

/^.*?(?!<(a|b|i|u|em|strong).*?>.*?)\btest\b(?!.*?<\/\1>)/i

Test

A test line                          => MATCH
<a href="">A test line</a>           => NO MATCH
<a href="">A test <em>line</em></a>  => NO MATCH

Discussion

^.*?(?!<(a|b|i|u|em|strong).*?>.*?)   => The keyword `test' must not be preceded by 
                                         any tag listed followed by any character
\btest\b                              => Here we define the keyword we want to match
(?!.*?</\1>)                          => The keyword `test' must not be followed by
                                         the tag opened previously

Tip

You can enhance the regexp for multiple keywords (kw1,kw2,kw3 in the example below) like this :

/^.*?(?!<(a|b|i|u|em|strong).*?>.*?)\b(?:kw1|kw2|kw3)\b(?!.*?<\/\1>)/i

Warning

This regex actually works on the provided test but not in all cases.

Stephan
  • 41,764
  • 65
  • 238
  • 329
  • `$str = preg_replace('/(?<!<(a|b|i|u|em|strong).*?>.*?)\btest\b(?!.*?\1>)/i', "\\2", "A test line", 1);` Gives error: Unknown modifier '\' – fire Jul 08 '11 at 10:56
  • I have changed the regexp according to php regex flavor limitation. – Stephan Jul 08 '11 at 11:05
  • I must admit regex is not well suited in this case. Anyway, those kind of challenge are always fun. – Stephan Jul 08 '11 at 11:10
  • Indeed.. now the issue is that you dont' see the text before the matching word... `$str = preg_replace("/^.*?(?!<(a|b|i|u|em|strong).*?>.*?)\b(test)\b(?!.*?<\/\1>)/is", "\\2", "A test line", 1);` – fire Jul 08 '11 at 11:12
  • Try this : `$str = preg_replace("/^(.*?)(?!<(a|b|i|u|em|strong).*?>.*?)\b(test)\b(?!.*?<\/\2>)/is", "\\1\\3", "A test line", 1);` – Stephan Jul 08 '11 at 11:17