17

I have the following string:

Lorem ipsum Test dolor sit amet, consetetur sadipscing elitr, sed diam nonumy <a href="http://Test.com/url">Test</a> eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd sed Test dolores et ea rebum. Stet clita kasd gubergren, no sea <a href="http://url.com">Test xyz</a> takimata sanctus est Lorem ipsum dolor sit amet.

Now I would replace the string 'Test' outside of tags an not between tags (e.g. replaced with '1234').

Lorem ipsum 1234 dolor sit amet, consetetur sadipscing elitr, sed diam nonumy <a href="http://Test.com/url">Test</a> eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd sed 1234 dolores et ea rebum. Stet clita kasd gubergren, no sea <a href="http://url.com">Test xyz</a> takimata sanctus est Lorem ipsum dolor sit amet.

I started with this regex: (?!<a[^>]*>)(Test)([^<])(?!</a>)

But two problems are not solved:

  1. The text 'Test' gets also replaced inside Tags (e.g. <a href="http://Test.com/url">)
  2. Does the text between the tag not exactly match the searched text, it will be also replaced (e.g. <a href="http://url">Test xyz</a>)

I hope someone has a solution to solve this problem.

zb226
  • 9,586
  • 6
  • 49
  • 79
Weri
  • 273
  • 1
  • 2
  • 5

6 Answers6

26

Answer

Use

(Test)(?!(.(?!<a))*</a>)

Explanation

Let me remind you of the meaning of some symbols:

1) ?! is a negative lookahead, for example r(?!d) selects all r that are not directly followed by an d:

enter image description here

2) Therefore never start a negative lookahead without a character. Just (?!d) is meaningless:

enter image description here

3) The ? can be used as a lazy match. For example .+E would select from

123EEE

the whole string 123EEE. However, .+?E selects as few "any charater" (.+) as needed. It would only select 123E.

Answer:

Protist answer is that you should use (?!<a[^>]*?>)(Test)(?![^<]*?</a>). Let me explain how to make this shorter first.

As mentioned in 2), it is meaningless to put a lookahead before the match. So the following is equivalent to protist answer:

(Test)(?![^<]*?</a>)

also since < is not allowed, the lazy match ? is superfluous, so its also equivalent to

(Test)(?![^<]*</a>)

This selects all Test that are not followed by an </a> without the symbol < in between. This is why Test which appears before or after any <a ...> .. </a> will be replaced.

However, note that

Lorem Test dolor <a href="http://Test.com/url">Test <strong>dolor</strong></a> eirmod

would be changed to

Lorem 1234 dolor <a href="http://1234.com/url">1234 <strong>dolor</strong></a> eirmod 

In order to catch that you could change your regex to

(Test)(?!(.(?!<a))*</a>)

which does the following:

Select every word Test that is not followed by a string ***</a> where each character in *** is not followed by <a.

Note that the dot . is important (see 2)).

Note that a lazy match like (Test)(?!(.(?!<a))*?</a>) is not relevant because nested links are illegal in HTML4 and HTML5 (smth like <a href="#">..<a href="#">...</a>..</a>).

protist said

Also, using regexes on raw HTML is not recommended.

I agree with that. A problem is that it would cause problems if a tag is not closed or opened. For example all mentioned solutions here would change

Lorem Test dolor Test <strong>dolor</strong></a> eirmod

to

Lorem Test dolor Test <strong>dolor</strong></a> eirmod 1234 dolores sea 1234 takimata 
Adam
  • 25,960
  • 22
  • 158
  • 247
13
(?!<a[^>]*?>)(Test)(?![^<]*?</a>)

same as zb226, but optimized with a lazy match

Also, using regexes on raw HTML is not recommended.

protist
  • 1,172
  • 7
  • 9
6

This should do the trick:

(?!<a[^>]*>)(Test)(?![^<]*</a>)

Try it yourself on regexr.

Follow-up: As Adam explains above, the first part has no effect and can be dropped entirely:

(Test)(?![^<]*</a>)
zb226
  • 9,586
  • 6
  • 49
  • 79
  • 1
    It is meaningless to put a lookahead before the match – Adam Apr 17 '18 at 22:43
  • 1
    @Adam That's of course correct, thanks for the heads up :) – zb226 Apr 17 '18 at 23:04
  • This is not working on regexpr.com Why was this accepted? – Rualark Mar 20 '20 at 09:34
  • @Rualark: a) This answer is not accepted and b) I don't know about the gravity of the fact that it is "not working" on some regex-page I've never heard of, and which instantly trips my company's web firewall for malicious content. – zb226 Mar 23 '20 at 14:13
  • @Adam Upon revisiting this, I find that the lookahead at the beginning of the pattern is indeed crucial. That's weird because I remember testing your claim back then, and it held true! Going to try to come up with an explanation for that. – zb226 Mar 23 '20 at 14:25
  • @zb226 could you provide a minimal example where it does not work? – Adam Mar 23 '20 at 15:32
  • @zb226 Sorry, I was talking about https://regexr.com/ and this answer is not accepted, seems to be not working. – Rualark Mar 24 '20 at 00:34
3

Resurrecting this ancient question because it had a simple solution that wasn't mentioned.

With all the disclaimers about using regex to parse html, here is a simple way to do it.

Method for Perl / PCRE

<a[^>]*>[^<]*<\/a(*SKIP)(*F)|Test

demo

General Solution

<a[^>]*>[^<]*<\/a|(Test)

In this version, the text to be replaced is captured in Group 1 and the replacement is performed by a simple callback or lambda.

demo

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. For code implementation see the code samples in How to match a pattern unless...
Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • The most important part for me was to know `$replaced = preg_replace_callback( $regex, function($m) { if(empty($m[1])) return $m[0]; else return "Superman";}, $subject);`. So I need to return `m[0]` if `m[1]` is empty. Really nice to know. Thank you! – mgutt Apr 04 '15 at 14:03
0

Adapting the proposed solution by @protist, in this case searching for a phrase and excluding any matches inside of a script tag:

(?!<script[^>]*?>)(\bTest Phrase\b)(?![^<]*?<\/script>)

Demo

The answer provided by Adam, although more concise, takes longer to execute. This may be proven by editing the demo already mentioned in this comment.

  • What question are you answering? – Toto Jun 05 '19 at 16:48
  • The original question mentions "_when text is between specific tag_", my answer only broadens the solution in the event that someone needs to match against a phrase instead of a single word. – Benny Paulino Jun 05 '19 at 16:56
0

in_short

For nest <a> case:

(?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)(*SKIP)(*FAIL)|(Test)

details

for excluding html <a> (nest)

demo,  case, not good << not_good-in_nest_case

demo,  case, working << working-in_nest_case

  • regex

    • ((.)(?!(.(?!<a))*<\/a>)) (not_good-in_nest_case)
    • (?!<a[^>]*?>)(.)(?![^<]*?<\/a>) (not_good-in_nest_case)
    • (?<!<a>(.(?!<\/a>))*?). (not_good-in_nest_case)
    • <a[^>]*>[^<]*<\/a(*SKIP)(*F)|. (not_good-in_nest_case)
    • (?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)(*SKIP)(*FAIL)|.
      <- <a>(?<m>(?R)|(?:.(?!<a>|<\/a>))*.)*?<\/a> (working) (PCRE)
  • flag: gms

  • sample text (case when <a> is nested)

    this Test this
    <a>this Test this
    <a>this Test this</a>
    this Test this</a>
    
    this Test this
    <a>this Test this
    <a>this Test this</a>
    this Test this</a>
    
    this Test this
    <a>this Test this
    <a>this Test <a>this <em>Test</em> this</a>this</a> more <a>this Test this</a>
    this Test this</a>this Test this
    
  • explain:

    • (?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)(*SKIP)(*FAIL)|.
      -- match every character, except(/skip) the ones inside (?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)

    • (?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)
      -- matches all the <a>XXXXX</a>, including nest ones

    • (?<m>(\g<tagWrap>)|.)*?
      -- matches XXXXX inside the <a>XXXXX</a>, (\g<tagWrap>) tries to do a recursion whenever possible

      --ie:
      (\g<tagWrap>) tries to recurse & match <a> whenever possible;
      if success, goes into another recursion;
      if fail, |. matches this character -- that is not a <a>;

      *? make sure that: \

      1. recursion (\g<tagWrap>) is attempt/checked for every single character inside XXXXX;
      2. |. matches all the characters inside XXXXX (- when no recursion is needed);
        (XXXXX of <a>XXXXX</a>, but excluding the opening tags <a>/</a> --
        <a> will be matched by the (next) beginning of a recursion BB,
        </a> will be matched by the (next) ending of a recursion BB);
  • minor:

    • the use of (.(?!<a>))* to match everything, until(/break at) <a>, is a good hint.
  • minor:

for excluding html comment <!-- -->

demo, html comment << working (3rd one)

  • regex

    • .(?!(.(?!<!--))*-->) (not_good)
    • (?<!<!--.*?)(.)|(.)(?!.*?-->) (not_working)
    • (?<!<!--(.(?!-->))*?). (working) (Javascript regex implementation)
  • flag: gms

  • sample text (include case when comment is nested & malformed)

    this Test this
    <!--this Test this
    <!--this Test this-->
    this Test this-->
    
    this Test this
    <!--this Test this
    <!--this Test this-->
    this Test this-->
    
  • explain:

    • (?<!<!--(.(?!-->))*?).
      -- match every character outside the html comment (but the <!-- & --> openings still get matched...)

    • <!--(.(?!-->))*? match _ 1st / 1st + 2nd / 1st + 2nd + 3rd / ... _ characters starts from <!--, all the way, until down to .(?!-->) -- right before the closest -->.

Nor.Z
  • 555
  • 1
  • 5
  • 13