1

I need a reg expression which will find the target word or words in html (so in amongst tags) but NOT in an anchor or script tag. I have experimented for ages and came up with this

(?!<(script|a).*?>)(\btype 2 diabetes\b)(?!<\/(a|script)>)

assuming in this case the target to replace is type 2 diabetes

I though that this would be common question but all the references are to parts of an anchor, not to being not in an anchor or script tag at all but in amongst them and other tags

This is a test piece of data I have used both http://regexpal.com/ and http://gskinner.com/RegExr/ with the above expression and below test data, try as I might I just cannot exclude the bit in the anchors or script tags without excluding the bit between sets of anchors or script tags.

In the test data below only "type 2 diabetes" inside the

<p></p>

should be caught.

<a href="https://www.testsite.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
<p>type 2 Diabetes</p>
<a id="logo" href="https://www.help-diabetes.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Phil
  • 37
  • 4
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – 000 Jun 11 '13 at 14:34
  • Are you trying to write one of these things? http://support.cdn.mozilla.net/media/uploads/images/2012-04-16-09-27-42-9cf425.png – 000 Jun 11 '13 at 14:36
  • 1
    **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/ for examples of how to properly parse HTML with modules that have already been written, tested and debugged. – Andy Lester Jun 11 '13 at 14:56
  • The reason I am trying to do it is because we are using a thing called "no numbers re replacer" in a joomla site to add glossary terms on the fly. The Doctors can edit the articles separately and when rereplacer, using regular expressions, replaces the word with some mark up script to make a hover over tool tip. It works great apart from when a tool tip pops up over a link or gets rereplaced in some javascript on the page! I totally agree that html parsing would be better but the only option he allows is reg expression. Sad face – Phil Jun 12 '13 at 10:18

2 Answers2

0

To make a replacement when a target word occurs avoiding the a and script tags, you must try to match these tags (and their content) before the target words. Example:

$subject = <<<LOD
<a href="https://www.testsite.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
<p>type 2 Diabetes</p>
<a id="logo" href="https://www.help-diabetes.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
LOD;

$targets = array('type 2 diabetes', 'scarlet fever', 'bubonic plague');

$pattern = '~<(a|script)\b.+?</\1>|\b(?>' . implode('|', $targets) . ')\b~si';

$result = preg_replace_callback($pattern,
    function ($m) { return (isset($m[1])) ? $m[0] : '!!!rabbit!!!'; },
    $subject);

echo htmlspecialchars($result);

The callback function return the a or script tag as it when the first capture goup is set, or the replacement string.

Note that if you want a specific replacement for each target word, you can use an associative array:

$corr = array( 'type 2 diabetes' => 'marmot',
               'scarlet fever'   => 'nutria',
               'bubonic plague'  => 'weasel'  );

$pattern = '~<(a|script)\b.+?</\1>|\b(?>'
         . implode('|', array_keys($corr)) . ')\b~si';

$result = preg_replace_callback($pattern,
    function ($m) use ($corr) {
        return (isset($m[1])) ? $m[0] : $corr[strtolower($m[0])];
    },
    $subject);

Keep in mind that the best way to deal with html is to use the DOM

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • This is working the same way as mine but seems to still show in scripts and anchors. I totally get where you are coming from and can see the logic but alas its still matching in anchors and scrips, thanks for having a go though. – Phil Jun 12 '13 at 10:24
0

Do not use regex for this problem. Use an html parser. Here is a solution in python with BeautifulSoup:

from BeautifulSoup import BeautifulSoup

with open('Path/to/file', 'r') as content_file:
    content = content_file.read()

soup = BeautifulSoup(content)

matches = [el for el in soup(text=re.compile(r'type 2 diabetes')) if el.name not in ['a','script']]

# now you can modify the matched elements

with open('Path/to/file.modified', 'w') as output_file:
    output_file.write(str(soup))
Community
  • 1
  • 1
000
  • 26,951
  • 10
  • 71
  • 101
  • Ah, I missed the tag. Then use this. Whatever, it's the same. http://php.net/manual/en/book.dom.php – 000 Jun 11 '13 at 14:44