1

I'm trying to write a script that parses a block of HTML and matches words against a given glossary of terms. If it finds a match, it wraps the term in <a class="tooltip"></a> and provides a definition.

It's working okay -- except for two major shortcomings:

  1. It matches text that is in attributes
  2. It matches text that is already in an <a> tag, created a nested link.

Is there any way to have my regular expression match only words that are not in attributes, and not in <a> tags?

Here's the code I'm using, in case it's relevant:

foreach(Glossary::map() as $term => $def) {
  $search[] = "/\b($term)\b/i";
  self::$lookup[strtoupper($term)] = $def;
}

return preg_replace_callback($search, array(&$this,'replace'),$this->content);
Gumbo
  • 643,351
  • 109
  • 780
  • 844
Aaron
  • 1,617
  • 4
  • 13
  • 7
  • 13
    Here comes the "Don't do that with a regex" answers... – Ben S Dec 08 '09 at 19:24
  • Edit: That should read "not in A tags".. the HTML got stripped out. It's okay if the text appears in any tag other than an anchor tag. – Aaron Dec 08 '09 at 19:24
  • I fixed up the code blocks. When you have inline HTML that you want to have show up, surround with with backticks: ` – Ben S Dec 08 '09 at 19:25
  • 2
    Don’t do that with a regex. Use some markup to mark the terms and just replace the marked terms with your links (with a parser). – Gumbo Dec 08 '09 at 19:27
  • And to actually answer the question, NO there's no such regular expression. – falstro Dec 08 '09 at 19:30

3 Answers3

5

"Don't do that with a regex."

Use an HTML parser, then apply a regex to the contents of HTML elements as it identifies them. That will allow you to easily operate on lots of different variants of HTML structure, valid and otherwise, without a lot of cruft and hard-to-maintain regular expressions.

Robust and Mature HTML Parser for PHP

Community
  • 1
  • 1
Tim Sylvester
  • 22,897
  • 2
  • 80
  • 94
  • 5
    How did you link to another question on StackOverflow regarding this issue and not link to this one: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454? – jason Dec 08 '09 at 19:32
  • How did you like to anything without linking to: http://www.codinghorror.com/blog/archives/001311.html – Ben S Dec 08 '09 at 19:34
  • 1
    @Jason Because amusing as it may be, it doesn't actually help the OP accomplish anything. – Tim Sylvester Dec 09 '09 at 01:01
3

Personally, I prefer this answer.

Community
  • 1
  • 1
Lee
  • 18,529
  • 6
  • 58
  • 60
0

HTML parsing is an interesting research topic. What do you mean with HTML? There are standards (quite a few), and there are web pages. Most researchers do not use regular expressions to parse HTML

Stephan Eggermont
  • 15,847
  • 1
  • 38
  • 65