Regex replace text but exclude when text is between specific tag

Question

I have the following string:

Lorem ipsum Test dolor sit amet, consetetur sadipscing elitr, sed diam nonumy <a href="http://Test.com/url">Test</a> eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd sed Test dolores et ea rebum. Stet clita kasd gubergren, no sea <a href="http://url.com">Test xyz</a> takimata sanctus est Lorem ipsum dolor sit amet.

Now I would replace the string 'Test' outside of tags an not between tags (e.g. replaced with '1234').

Lorem ipsum 1234 dolor sit amet, consetetur sadipscing elitr, sed diam nonumy <a href="http://Test.com/url">Test</a> eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd sed 1234 dolores et ea rebum. Stet clita kasd gubergren, no sea <a href="http://url.com">Test xyz</a> takimata sanctus est Lorem ipsum dolor sit amet.

I started with this regex: (?!<a[^>]*>)(Test)([^<])(?!</a>)

But two problems are not solved:

The text 'Test' gets also replaced inside Tags (e.g. <a href="http://Test.com/url">)
Does the text between the tag not exactly match the searched text, it will be also replaced (e.g. <a href="http://url">Test xyz</a>)

I hope someone has a solution to solve this problem.

Adam · Answer 1 · 2018-11-14T09:12:27.313

Answer

Use

(Test)(?!(.(?!<a))*</a>)

Explanation

Let me remind you of the meaning of some symbols:

1) ?! is a negative lookahead, for example r(?!d) selects all r that are not directly followed by an d:

2) Therefore never start a negative lookahead without a character. Just (?!d) is meaningless:

3) The ? can be used as a lazy match. For example .+E would select from

123EEE

the whole string 123EEE. However, .+?E selects as few "any charater" (.+) as needed. It would only select 123E.

Answer:

Protist answer is that you should use (?!<a[^>]*?>)(Test)(?![^<]*?</a>). Let me explain how to make this shorter first.

As mentioned in 2), it is meaningless to put a lookahead before the match. So the following is equivalent to protist answer:

(Test)(?![^<]*?</a>)

also since < is not allowed, the lazy match ? is superfluous, so its also equivalent to

(Test)(?![^<]*</a>)

This selects all Test that are not followed by an </a> without the symbol < in between. This is why Test which appears before or after any <a ...> .. </a> will be replaced.

However, note that

Lorem Test dolor <a href="http://Test.com/url">Test <strong>dolor</strong></a> eirmod

would be changed to

Lorem 1234 dolor <a href="http://1234.com/url">1234 <strong>dolor</strong></a> eirmod

In order to catch that you could change your regex to

(Test)(?!(.(?!<a))*</a>)

which does the following:

Select every word Test that is not followed by a string ***</a> where each character in *** is not followed by <a.

Note that the dot . is important (see 2)).

Note that a lazy match like (Test)(?!(.(?!<a))*?</a>) is not relevant because nested links are illegal in HTML4 and HTML5 (smth like <a href="#">..<a href="#">...</a>..</a>).

protist said

Also, using regexes on raw HTML is not recommended.

I agree with that. A problem is that it would cause problems if a tag is not closed or opened. For example all mentioned solutions here would change

Lorem Test dolor Test <strong>dolor</strong></a> eirmod

to

Lorem Test dolor Test <strong>dolor</strong></a> eirmod 1234 dolores sea 1234 takimata

That is a solution and a great explanation. Your the winner!!! — Elbert Villarreal, Nov 27 '20 at 22:58
Does not work for input "TestTest" - the first "Test" is not replaced. If you put a space character after the initial Test ("Test )|(?= — xhafan, Apr 19 '22 at 11:35

score 13 · Accepted Answer · answered Sep 19 '12 at 11:48

13

(?!<a[^>]*?>)(Test)(?![^<]*?</a>)

same as zb226, but optimized with a lazy match

Also, using regexes on raw HTML is not recommended.

answered Sep 19 '12 at 11:48

protist

1,172
7
9

I also added the \b flag to match a word boundary: (?!]*?>)(\bTest\b)(?![^<]*?) – Weri Sep 19 '12 at 12:34
That should give the regex optimizer more to work with. It also should not adversely affect your matches, as long as `_Test_, _Test, or Test_` are not in your document (and assuming you would not care to match them if they were). – protist Sep 19 '12 at 13:10
1

The lookaheaed before Test and the lazy match are meaningless. See my answer. – Adam Oct 25 '17 at 16:38
1

This is not working on regexpr.com Why was this accepted? – Rualark Mar 20 '20 at 09:34

zb226 · Answer 3 · 2020-03-23T23:59:16.513

6

This should do the trick:

(?!<a[^>]*>)(Test)(?![^<]*</a>)

Try it yourself on regexr.

Follow-up: As Adam explains above, the first part has no effect and can be dropped entirely:

(Test)(?![^<]*</a>)

edited Mar 23 '20 at 23:59

answered Sep 19 '12 at 11:24

zb226

9,586
6
49
79

1

It is meaningless to put a lookahead before the match – Adam Apr 17 '18 at 22:43
1

@Adam That's of course correct, thanks for the heads up :) – zb226 Apr 17 '18 at 23:04
This is not working on regexpr.com Why was this accepted? – Rualark Mar 20 '20 at 09:34
@Rualark: a) This answer is not accepted and b) I don't know about the gravity of the fact that it is "not working" on some regex-page I've never heard of, and which instantly trips my company's web firewall for malicious content. – zb226 Mar 23 '20 at 14:13
@Adam Upon revisiting this, I find that the lookahead at the beginning of the pattern is indeed crucial. That's weird because I remember testing your claim back then, and it held true! Going to try to come up with an explanation for that. – zb226 Mar 23 '20 at 14:25
@zb226 could you provide a minimal example where it does not work? – Adam Mar 23 '20 at 15:32
@zb226 Sorry, I was talking about https://regexr.com/ and this answer is not accepted, seems to be not working. – Rualark Mar 24 '20 at 00:34

score 3 · Answer 4 · edited May 23 '17 at 12:25

3

Resurrecting this ancient question because it had a simple solution that wasn't mentioned.

With all the disclaimers about using regex to parse html, here is a simple way to do it.

Method for Perl / PCRE

<a[^>]*>[^<]*<\/a(*SKIP)(*F)|Test

demo

General Solution

<a[^>]*>[^<]*<\/a|(Test)

In this version, the text to be replaced is captured in Group 1 and the replacement is performed by a simple callback or lambda.

demo

Reference

How to match pattern except in situations s1, s2, s3
For code implementation see the code samples in How to match a pattern unless...

edited May 23 '17 at 12:25

Community

1
1

answered May 15 '14 at 00:06

zx81

41,100
9
89
105

The most important part for me was to know `$replaced = preg_replace_callback( $regex, function($m) { if(empty($m[1])) return $m[0]; else return "Superman";}, $subject);`. So I need to return `m[0]` if `m[1]` is empty. Really nice to know. Thank you! – mgutt Apr 04 '15 at 14:03

Benny Paulino · Answer 5 · 2019-06-05T17:09:18.437

0

Adapting the proposed solution by @protist, in this case searching for a phrase and excluding any matches inside of a script tag:

(?!<script[^>]*?>)(\bTest Phrase\b)(?![^<]*?<\/script>)

Demo

The answer provided by Adam, although more concise, takes longer to execute. This may be proven by editing the demo already mentioned in this comment.

edited Jun 05 '19 at 17:09

answered Jun 05 '19 at 16:46

Benny Paulino

36
7

What question are you answering? – Toto Jun 05 '19 at 16:48
The original question mentions "_when text is between specific tag_", my answer only broadens the solution in the event that someone needs to match against a phrase instead of a single word. – Benny Paulino Jun 05 '19 at 16:56

Nor.Z · Answer 6 · 2023-03-15T16:13:21.293

in_short

For nest <a> case:

(?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)(*SKIP)(*FAIL)|(Test)

details

for excluding html `<a>` (nest)

<< not_good-in_nest_case

<< working-in_nest_case

regex
- ((.)(?!(.(?!<a))*<\/a>)) (not_good-in_nest_case)
- (?!<a[^>]*?>)(.)(?![^<]*?<\/a>) (not_good-in_nest_case)
- (?<!<a>(.(?!<\/a>))*?). (not_good-in_nest_case)
- <a[^>]*>[^<]*<\/a(*SKIP)(*F)|. (not_good-in_nest_case)
- (?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)(*SKIP)(*FAIL)|.
  <- <a>(?<m>(?R)|(?:.(?!<a>|<\/a>))*.)*?<\/a> (working) (PCRE)
flag: gms

sample text (case when <a> is nested)

this Test this
<a>this Test this
<a>this Test this</a>
this Test this</a>

this Test this
<a>this Test this
<a>this Test this</a>
this Test this</a>

this Test this
<a>this Test this
<a>this Test <a>this <em>Test</em> this</a>this</a> more <a>this Test this</a>
this Test this</a>this Test this

explain:
- (?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)(*SKIP)(*FAIL)|.
  -- match every character, except(/skip) the ones inside (?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)
- (?<tagWrap><a>(?<m>(\g<tagWrap>)|.)*?<\/a>)
  -- matches all the <a>XXXXX</a>, including nest ones
- (?<m>(\g<tagWrap>)|.)*?
  -- matches XXXXX inside the <a>XXXXX</a>, (\g<tagWrap>) tries to do a recursion whenever possible
  
  --ie:
  (\g<tagWrap>) tries to recurse & match <a> whenever possible;
  if success, goes into another recursion;
  if fail, |. matches this character -- that is not a <a>;
  
  *? make sure that: \
  1. recursion (\g<tagWrap>) is attempt/checked for every single character inside XXXXX;
  2. |. matches all the characters inside XXXXX (- when no recursion is needed);
    (XXXXX of <a>XXXXX</a>, but excluding the opening tags <a>/</a> --
    <a> will be matched by the (next) beginning of a recursion BB,
    </a> will be matched by the (next) ending of a recursion BB);
minor:
- the use of (.(?!<a>))* to match everything, until(/break at) <a>, is a good hint.
minor:
- simple ex for recursion
- $(a|(?R))$
  ((a)) / (((a))) / ((((a)))) (match these)
- Regular expression to match balanced parentheses

for excluding html comment ``

<< working (3rd one)

regex
- .(?!(.(?!) (not_good)
- (?<!) (not_working)
- (?<!))*?). (working) (Javascript regex implementation)
flag: gms

sample text (include case when comment is nested & malformed)

this Test this
<!--this Test this
<!--this Test this-->
this Test this-->

this Test this
<!--this Test this
<!--this Test this-->
this Test this-->

explain:
- (?<!))*?).
  -- match every character outside the html comment (but the  openings still get matched...)
- ))*? match _ 1st / 1st + 2nd / 1st + 2nd + 3rd / ... _ characters starts from ) -- right before the closest -->.

Regex replace text but exclude when text is between specific tag

6 Answers6

in_short

details

for excluding html `<a>` (nest)

for excluding html comment ``

Linked

Regex replace text but exclude when text is between specific tag

6 Answers6

in_short

details

for excluding html <a> (nest)

for excluding html comment

Linked

for excluding html `<a>` (nest)

for excluding html comment ``