Replacing all matches except if surrounded by or only if surrounded by

Question

Given a text string (a markdown document) I need to achieve one of this two options:

to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax ![Blah theWord blah](url).
to replace all the matches of a particular expression ({{([^}}]+)}}\[\[[^\]\]]+\]\]) ONLY inside the markdown images, ie.: ![Blah {{theWord}}[[1234]] blah](url).

Both expressions are currently matching everything, no matter if inside the markdown image syntax or not, and I've already tried everything I could think.

Here is an example of the first option

And here is an example of the second option

Any help and/or clue will be highly appreciated.

Thanks in advance!

Url should not contain spaces. So, the simplest way is remove them from the second patern — splash58, May 19 '15 at 19:25
Actually, I need to do nothing with the url, the problem is to **avoid** the matches inside the `![...]` part for the first option... or get **only** those inside that `![...]` part for the second one. I think it would be clear looking the posted examples. — ala_747, May 19 '15 at 19:41

revo · Answer 1 · 2015-05-19T20:57:05.770

2

Well I modified first expression a little bit as I thought there are some extra capturing groups then made them by adding a lookahead trick:

-First one (Live demo):

\b(vitae)\b(?![^[]*]\s*\()

-Second one (Live demo):

{{([^}}]+)}}\[\[[^\]\]]+\]\](?=[^[]*]\s*\()

Lookahead part explanations:

(?!            # Starting a negative lookahead
    [^[]*]     # Everything that's between brackets
    \s*        # Any whitespace
    \(         # Check if it's followed by an opening parentheses  
)              # End of lookahead which confirms the whole expression doesn't match between brackets

(?= means a positive lookahead

edited May 19 '15 at 20:57

answered May 19 '15 at 20:46

revo

47,783
14
74
117

Demo 1 needs `i` modifier. Nice solutions though. – chris85 May 19 '15 at 20:56
These solutions are workarounds. If there is a stray square or round bracket, there might be a problem. – Wiktor Stribiżew May 19 '15 at 21:06
@stribizhev It's markdown! So it comes with standards. – revo May 19 '15 at 21:08
@stribizhev That's normal! Because you messed up the text with no reason and I can ruin your input as well. – revo May 19 '15 at 22:43

Federico Piazza · Answer 2 · 2015-05-19T21:19:01.627

1

You can leverage the discard technique that it really useful for this cases. It consists of having below pattern:

patternToSkip1 (*SKIP)(*FAIL)|patternToSkip2 (*SKIP)(*FAIL)| MATCH THIS PATTERN

So, according you needs:

to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax

You can easily achieve this in pcre through (*SKIP)(*FAIL) flags, so for you case you can use a regex like this:

\[.*?\](*SKIP)(*FAIL)|\bTheWord\b

Or using your pattern:

\[.*?\](*SKIP)(*FAIL)|(\W)(theWord)(\W)

The idea behind this regex is tell regex engine to skip the content within [...]

Working demo

edited May 19 '15 at 21:19

answered May 19 '15 at 20:46

Federico Piazza

30,085
15
87
123

True, SKIP-FAIL is the correct trick for the 1st option. However, you should have used `\b` around `theWord` since the `(\W)` is redundant overhead. – Wiktor Stribiżew May 19 '15 at 21:05
@stribizhev FAIL is the easier-to-read alternative to a negative lookahead `(?!)` and not a magic – revo May 19 '15 at 21:12
@stribizhev You said *SKIP-FAIL is the correct trick* and I mentioned it's nothing but an alternative. – revo May 19 '15 at 21:16
@stribizhev Gosh I'm not comparing between alternatives and that's like those guys who can't stick with their girl friend: *-You know, she's a much better alternative.* – revo May 19 '15 at 21:27
@stribizhev **Please avoid extended discussions in comments.** Okay SO, I just wanna say one more thing... don't rationalize! – revo May 19 '15 at 21:34

score 0 · Answer 3 · edited May 23 '17 at 11:43

0

The first regex is easily fixed with a SKIP-FAIL trick:

\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b

To replace with the word of your choice. It is a totally valid way in PHP (PCRE) regex to match something outside some markers.

See Demo 1

As for the second one, it is harder, but acheivable with \G that ensures we match consecutively inside some markers:

(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))

To replace with $1$2{{NEW_REPLACED_TEXT}}[[NEW_DIGITS]]

See Demo 2

PHP:

$re1 = "#\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b#i";
$re2 = "#(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))#i";

edited May 23 '17 at 11:43

Community

1
1

answered May 19 '15 at 21:03

Wiktor Stribiżew

607,720
39
448
563

To be more accurate on patterns, to your second regex catastrophic backtracking is very likely to happen and I have a question: why atomic groups? Engine says *4 matches - 18720 steps*. Don't you think that you are killing the performance by the way? – revo May 19 '15 at 21:59
Once you come across catastrophic backtracking with this regex, please let me know. I am always more worried with accuracy and stability. What good is a quick regex that does not match what you need or overfire? – Wiktor Stribiżew May 19 '15 at 22:03
Like when there is a space between `] (`. Spaces are escaped within reserved symbols by the markdown parser and would you mind answering why you came with using atomic groups here? – revo May 19 '15 at 22:15
Actually, the `](` *is* part of the markdown. If there are spaces between, they must be added into the pattern. The regex will not fail in case there are stray brackets inside the*text*, not *markdown*. Also, atomic groupings are useful when we are not interested in [backtracking positions remembered by any tokens inside the group](http://www.regular-expressions.info/atomic.html). – Wiktor Stribiżew May 19 '15 at 22:28
Forgetting backtracking positions - as I'm aware - is useful when you have multiple piped sequences inside the atomic group, and you don't have any. You're just checking a single expression at a time. So it's totally useless. Isn't it? – revo May 19 '15 at 22:37
Tastes differ, like regex flavors. There is no difference here if I use a non-capturing group or atomic one. – Wiktor Stribiżew May 19 '15 at 22:47
Good night, I am off to bed – Wiktor Stribiżew May 19 '15 at 22:47
I got my answer! Good night! – revo May 19 '15 at 22:48

Replacing all matches except if surrounded by or only if surrounded by

3 Answers3

Related