Regex to match words or phrases in string but NOT match if part of a URL or inside tags. (php)

Question

I am aware that regex is not ideal for use with HTML strings and I have looked at the PHP Simple HTML DOM Parser but still believe this is the way to go. All the HTML tags will be generated by my forum software so they will be consistent and valid HTML.

What I am trying to do is make a plugin that will find a list of keywords (or phrases) in a string of HTML and replace them with a link I specify. For example if someone types:

I use Amazon for that.

it would replace it with:

I use <a href="http://www.amazon.com">Amazon</a> for that.

The problem is of course is that if "amazon" is in the URL it would also get replaced. I solved that issue with a callback function found on this site, slightly modified.

But now I still have an issue, it still replaces words between opening and closing tags.

<a href="http://www.amazon.com">My Amazon Link</a>

It will match the "Amazon" in "My Amazon Link"

What I really need is a regex to match say "amazon" anywhere except between <a href and </a>

Any ideas?

do a search this question has already been answered a gazillion times — Lawrence Cherone, May 15 '11 at 15:46
fwiw I did search and I looked at every suggestion I got when I put in the subject. I was probably phrasing it poorly but I was searching for 2 days. — Joe D., May 15 '11 at 17:59
Just a follow-up. In testing I found that if someone had included an image tag to an amazon source it would also attempt to convert the word inside the tag to a link as well. I modified the regex to ignore tags too (well XHTML image tags, in fact all XHTML tags): (?![^<]*(|" />)) — Joe D., May 17 '11 at 12:06

score 9 · Accepted Answer · answered May 15 '11 at 16:06

Using the DOM would certainly be preferable.

However, you might get away with this:

$result = preg_replace('%Amazon(?![^<]*</a>)%i', '<a href="http://www.amazon.com">Amazon</a>', $subject);

It matches Amazon only if

it's not followed by a closing </a> tag,
it's not itself part of a tag,
there are no intervening tags, i. e. it will be thrown off if tags can be nested inside <a> tags.

It will therefore change this:

I use Amazon for that.
I use <a href="http://www.amazon.com">Amazon</a> for that.
<a href="http://www.amazon.com">My Amazon Link</a>
It will match the "Amazon" in "My Amazon Link"

into this:

I use <a href="http://www.amazon.com">Amazon</a> for that.
I use <a href="http://www.amazon.com">Amazon</a> for that.
<a href="http://www.amazon.com">My Amazon Link</a>
It will match the "<a href="http://www.amazon.com">Amazon</a>" in "My <a href="http://www.amazon.com">Amazon</a> Link"

This is actually working perfectly for me. Big Thanks. I do hope to learn the DOM soon but I was pretty sure I could "get away" with a regex for now. Thanks to everyone else too. @anubhava I tried your code first but it was still interfering with existing tags. — Joe D., May 15 '11 at 17:29

score 7 · Answer 2 · answered May 15 '11 at 16:12

7

Don't do this. You cannot reliably do this with Regex, no matter how consistent your HTML is.

Something like this should work, however:

<?php
$dom = new DOMDocument;
$dom->load('test.xml');
$x = new DOMXPath($dom);

$nodes = $x->query("//text()[contains(., 'Amazon')][not(ancestor::a)]");

foreach ($nodes as $node) {
    while (false !== strpos($node->nodeValue, 'Amazon')) {
        $word = $node->splitText(strpos($node->nodeValue, 'Amazon'));
        $after = $word->splitText(6);

        $link = $dom->createElement('a');
        $link->setAttribute('href', 'http://www.amazon.com');

        $word->parentNode->replaceChild($link, $word);
        $link->appendChild($word);

        $node = $after;
    }
}

$html = $dom->saveHTML();
echo $html;

It's verbose, but it will actually work.

answered May 15 '11 at 16:12

lonesomeday

233,373
50
316
318

When I get some time I'm going to play with this to learn the DOM. Off hand I was really using an array of strings in place of "Amazon" so I was leaning more towards the regex functions I knew would work. But thank you for your time, it won't go to waste. :) – Joe D. May 15 '11 at 17:38
Yeah, this is a nicer solution, but like Joe D., I need to match an array of keywords (currently being imploded with pipes into a regex). So the regex solution above is working out for the time being, but I would be interested to know if there's a way to do that with the DOM method. – Skwerl Mar 04 '12 at 01:48
@KevinCogill Yes – this would be trivial to implement. You'd have to loop through all the text nodes, not just the ones containing `Amazon`, and alter the `while` loop to check for more than one thing. This shouldn't be too difficult. – lonesomeday Mar 04 '12 at 07:47

score 3 · Answer 3 · answered May 15 '11 at 16:05

3

Try this here

Amazon(?![^<]*</a>)

This will search for Amazon and the negative lookahead ensures that there is no closing tag behind. And I search there only for not < so that I will not read a opening tag accidentally.

http://regexr.com

answered May 15 '11 at 16:05

stema

90,351
20
107
135

score 1 · Answer 4 · edited May 23 '17 at 12:19

Joe, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a general question about how to exclude patterns in regex.)

With all the disclaimers about using regex to parse html, here is a simple way to do it.

Here's our simple regex:

<a.*?</a>(*SKIP)(*F)|amazon

The left side of the alternation matches complete <a... </a> tags, then deliberately fails. The right side matches amazon, and we know this is the right amazon because it was not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

<?php
$target = "word1 <a stuff amazon> </a> word2 amazon";
$regex = "~(?i)<a.*?</a>(*SKIP)(*F)|amazon~";
$repl= '<a href="http://www.amazon.com">Amazon</a>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

score 1 · Answer 5 · answered May 15 '11 at 15:51

Unfortunately I think the logic you need is still more complex than text pattern matching :-/

I know it's not the answer you want to hear, but you'll probably get better results with a DOM model.

Here's a discussion of this topic elsewhere: http://coderzone.org/forum/index.php?topic=84.0

Is it possible to just run the filter once, so you don't end up with dupes? Or could the original corpus also include links?

score 0 · Answer 6 · answered Dec 07 '17 at 08:51

0

Improvisation. It should link only if it is a whole word "Amazon" and not words like AmazonWorld.

$result = preg_replace('%\bAmazon(?![^<]*</a>)\b%i', '<a href="http://www.amazon.com">Amazon</a>', $subject);

answered Dec 07 '17 at 08:51

Vikram

187
9

score 0 · Answer 7 · answered May 15 '11 at 15:55

0

Use this code:

$p = '~((<a\s)(?(2)[^>]*?>))?(amazon)~smi';

$str = '<a href="http://www.amazon.com">Amazon</a>';

$s = preg_replace($p, "$1My $3 Link", $str);
var_dump($s);

OUTPUT

String(50) "<a href="http://www.amazon.com">My Amazon Link</a>"

answered May 15 '11 at 15:55

anubhava

761,203
64
569
643

Regex to match words or phrases in string but NOT match if part of a URL or inside tags. (php)

7 Answers7

OUTPUT

Linked