How to ignore regex matches wrapped by a particular string?

Question

I had a great idea for some functionality on a project and I've tried to implement it to the best of my ability but I need a little help achieving the desired effect. The page in question is: http://dev.favorcollective.com/guidelines/ (just to provide some context)

I'm using php's preg_replace to go through a particular page's contents (giant string) and I'm having it search for glossary terms and then I wrap the terms with a bit of html that enables dynamic glossary definition tooltips.

Here is my current code:

function annotate($content)
{
    global $glossary_terms;
    $search =  array();
    $replace = array();
    $count=1;

    foreach ($glossary_terms as $term):
        array_push($search,'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i');
        $id = "annotation-".$count;
        $replacement = '<a href="'.get_bloginfo('url').'/glossary#'.preg_replace( '/\s+/', '', $term['term']).'" class="annotation" rel="'.$id.'">'.$term['term'].'</a><span id="'.$id.'" style="display:none;"><span class="term">'.$term['term'].'</span><span class="definition">'.$term['def'].'</span></span>';
         array_push($replace,(string)$replacement);

         $count++;

    endforeach;

    return preg_replace($search, $replace, $content);
}

• But what if I want to ignore matches inside of <h#> </h#> tags?

• I also have a particular string that I do not want a specific term to match within. For example, I want the word "proficiency" to match any time it is NOT used in the context of "ACTFL Proficiency Guidelines" how would I go about adding exceptions to my regular expression? Is that even an option?

• Finally, how can I return the matched text as a variable? Currently when I match for a term ending in 's' or 'ing' (on purpose) my script prints the matched term rather than the original string that was matched (i.e. it's replacing "descriptions" with "description"). Is there anyway to do that?

Welcome to SO! Please read [this introductory article](http://stackoverflow.com/a/1732454/596781) on processing HTML with regular expressions. — Kerrek SB, Dec 15 '11 at 17:46
Can you or someone else provide an example of what I'm trying to achieve using a PHP HTML parser? Should I revise my question? I never new regex was so limited- I was under the impression that it was the end all be all for programming. The holy grail. — Jake Downs, Dec 15 '11 at 18:06
There is no holy grail for programming. I don't think that you should revise this question in a way that completely changes the scope, because there is already a pretty decent answer. Make a new question to ask for a parser example. — JosephRuby, Dec 18 '11 at 08:17

score 3 · Answer 1 · edited Dec 18 '11 at 02:16

not a php guy (c#), but here goes. I assume that:

'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i' will map to this far more readable pattern:

/\b(ESCAPED_TERM)[?=a-zA-Z]*/i

so, as far as excluding <h#> type tags, regex is ok only if you can assume your data would be the simple, non-nested case: <h#>TERM<h#>. If you can, you can use a negative lookahead assertion:

/\b(ESCAPED_TERM)(?!<h\d>)[?=a-zA-Z]*/i

you can use a lookahead with a lookbehind to handle your special case:

/\b(ESCAPED_TERM|(?<!ACTFL )Proficiency(?!\sGuidelines))(?!<h\d>)[?=a-zA-Z]*/i

note: if you have a bunch of these special cases, PHP might (should) have an "ignore whitespace" flag which will let you put each token on newline.

score 0 · Answer 2 · answered Dec 15 '11 at 20:31

Regular expressions are awesome, wonderful, magical. But everything has its limits.

That's why it's nice to have a language like PHP to provide the extra functionality. :)

Can you strip out headers with a non-greedy regexp?

$content = preg_replace('/<h[1-6]>.*?<\/h[1-6]>/sim', "", $content);

If non-greedy evaluations aren't working, what about just assuming that there won't be any other HTML inside your headers?

$content = preg_replace('/<h[1-6]>[^<]*<\/h[1-6]>/im', "", $content);

Also, you might want to use sprintf to simplify your replacement:

/*
  1  get_bloginfo('url')
  2  preg_replace( '/\s+/', '', $term['term']).
  3  $id
  4  $term['term']
  5  $term['def']
*/
$rfmt = '<a href="%1$s/glossary#%2$s" class="annotation" rel="%3$s">%4$s</a><span id="%3$s" style="display:none;"><span class="term">%4$s</span><span class="definition">%5$s</span></span>';

...

$replacement = sprintf($rfmt, get_bloginfo('url'), preg_replace( '/\s+/', '', $term['term']), $id, $term['term'], $term['def'] );

How to ignore regex matches wrapped by a particular string?

2 Answers2