0

I have a regular expression that is used to replace a phrase if the phrase is not contained by HTML anchor tags or an IMG tag. For this example the phrase being searched for is "hello world"

The .net regular expression is

(?<!<a [^<]+)(?<!<img [^<]+)(?<=[ ,.;!]+)hello world(?=[ ,.;&!]+)(?!!.*</a>)

E.G. The regular expression should match "hello world" in a phrase like

"one two three hello world four five"

But shouldn't match hello world in a phrase like

"one two three <a href='index.html'> hello world </a> four five"

or

"one two three <img alt='hello world' \>four five"

It is associated with the following question from when I was originally developing the .Net version. regular expression that doesn't match a string if it's the text within an html anchor tag

Any guidance on how to go about converting this to a php regex would be very much appreciated.

Community
  • 1
  • 1
Rich
  • 4,572
  • 3
  • 25
  • 31
  • This regular expression shouldn't match anything in .NET, too. A closing tag has a slash (``), but your expression looks for a backslash. – Daniel Hilgarth Oct 07 '13 at 13:39
  • And what is the problem now? – Daniel Hilgarth Oct 07 '13 at 13:44
  • It's not looking for the closing '>' for the image tag? – Rich Oct 07 '13 at 13:48
  • And what does that have to do with .NET vs PHP? Your regular expression simply doesn't contain a part that would look for something like that... – Daniel Hilgarth Oct 07 '13 at 13:50
  • @DanielHilgarth: There is variable-length negative lookbehind. That's supported in .NET but not in PCRE. – Jon Oct 07 '13 at 13:50
  • @Jon what sort of php compatible technique should I be looking for if variable-length negative look behind isn't supported? – Rich Oct 07 '13 at 13:58
  • @Rich: There is no one-size-fits-all solution. Frankly this is borderline ["now you have two problems"](http://regex.info/blog/2006-09-15/247) territory. See if it's possible to write the regex so that it matches what should be rejected instead of accepted, and have the `\K` [escape sequence](http://php.net/manual/en/regexp.reference.escape.php) in mind. – Jon Oct 07 '13 at 14:11
  • @Jon thanks for the advice and the \K tip. I'll try your approach and post the solution here once I've arrived at something acceptable. – Rich Oct 07 '13 at 14:22

1 Answers1

1

Note: Do not use regular expression to parse tags.

For a or img tags specifically, you could do the following.

(?!<(?:a|img)[^>]*?>)\bhello world\b(?![^<]*?(?:</a>|>))

See live demo

I suppose for anything in or between tags, you could try this.

(?!<[^>]*?>)\bhello world\b(?![^<]*?(?:</[^/]*>|>))

See live demo

hwnd
  • 69,796
  • 4
  • 95
  • 132
  • I've used your first regex successfully with a minor tweak for case insensitive matching. Thanks for your help and thanks for the live demos, I will be using that site the next time I have the misfortune to have to write a regex. – Rich Oct 08 '13 at 19:14