find word but not in a link

Question

I need a reg expression which will find the target word or words in html (so in amongst tags) but NOT in an anchor or script tag. I have experimented for ages and came up with this

(?!<(script|a).*?>)(\btype 2 diabetes\b)(?!<\/(a|script)>)

assuming in this case the target to replace is type 2 diabetes

I though that this would be common question but all the references are to parts of an anchor, not to being not in an anchor or script tag at all but in amongst them and other tags

This is a test piece of data I have used both http://regexpal.com/ and http://gskinner.com/RegExr/ with the above expression and below test data, try as I might I just cannot exclude the bit in the anchors or script tags without excluding the bit between sets of anchors or script tags.

In the test data below only "type 2 diabetes" inside the

<p></p>

should be caught.

<a href="https://www.testsite.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
<p>type 2 Diabetes</p>
<a id="logo" href="https://www.help-diabetes.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — 000, Jun 11 '13 at 14:34
Are you trying to write one of these things? http://support.cdn.mozilla.net/media/uploads/images/2012-04-16-09-27-42-9cf425.png — 000, Jun 11 '13 at 14:36
**Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/ for examples of how to properly parse HTML with modules that have already been written, tested and debugged. — Andy Lester, Jun 11 '13 at 14:56
The reason I am trying to do it is because we are using a thing called "no numbers re replacer" in a joomla site to add glossary terms on the fly. The Doctors can edit the articles separately and when rereplacer, using regular expressions, replaces the word with some mark up script to make a hover over tool tip. It works great apart from when a tool tip pops up over a link or gets rereplaced in some javascript on the page! I totally agree that html parsing would be better but the only option he allows is reg expression. Sad face — Phil, Jun 12 '13 at 10:18

Casimir et Hippolyte · Answer 1 · 2013-06-12T17:23:49.197

To make a replacement when a target word occurs avoiding the a and script tags, you must try to match these tags (and their content) before the target words. Example:

$subject = <<<LOD
<a href="https://www.testsite.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
<p>type 2 Diabetes</p>
<a id="logo" href="https://www.help-diabetes.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
LOD;

$targets = array('type 2 diabetes', 'scarlet fever', 'bubonic plague');

$pattern = '~<(a|script)\b.+?</\1>|\b(?>' . implode('|', $targets) . ')\b~si';

$result = preg_replace_callback($pattern,
    function ($m) { return (isset($m[1])) ? $m[0] : '!!!rabbit!!!'; },
    $subject);

echo htmlspecialchars($result);

The callback function return the a or script tag as it when the first capture goup is set, or the replacement string.

Note that if you want a specific replacement for each target word, you can use an associative array:

$corr = array( 'type 2 diabetes' => 'marmot',
               'scarlet fever'   => 'nutria',
               'bubonic plague'  => 'weasel'  );

$pattern = '~<(a|script)\b.+?</\1>|\b(?>'
         . implode('|', array_keys($corr)) . ')\b~si';

$result = preg_replace_callback($pattern,
    function ($m) use ($corr) {
        return (isset($m[1])) ? $m[0] : $corr[strtolower($m[0])];
    },
    $subject);

Keep in mind that the best way to deal with html is to use the DOM

This is working the same way as mine but seems to still show in scripts and anchors. I totally get where you are coming from and can see the logic but alas its still matching in anchors and scrips, thanks for having a go though. — Phil, Jun 12 '13 at 10:24

score 0 · Answer 2 · edited May 23 '17 at 12:05

0

Do not use regex for this problem. Use an html parser. Here is a solution in python with BeautifulSoup:

from BeautifulSoup import BeautifulSoup

with open('Path/to/file', 'r') as content_file:
    content = content_file.read()

soup = BeautifulSoup(content)

matches = [el for el in soup(text=re.compile(r'type 2 diabetes')) if el.name not in ['a','script']]

# now you can modify the matched elements

with open('Path/to/file.modified', 'w') as output_file:
    output_file.write(str(soup))

edited May 23 '17 at 12:05

Community

1
1

answered Jun 11 '13 at 14:43

000

26,951
10
71
101

Ah, I missed the tag. Then use this. Whatever, it's the same. http://php.net/manual/en/book.dom.php – 000 Jun 11 '13 at 14:44

find word but not in a link

2 Answers2