4

Given a text string (a markdown document) I need to achieve one of this two options:

  • to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax ![Blah theWord blah](url).

  • to replace all the matches of a particular expression ({{([^}}]+)}}\[\[[^\]\]]+\]\]) ONLY inside the markdown images, ie.: ![Blah {{theWord}}[[1234]] blah](url).

Both expressions are currently matching everything, no matter if inside the markdown image syntax or not, and I've already tried everything I could think.

Here is an example of the first option

And here is an example of the second option

Any help and/or clue will be highly appreciated.

Thanks in advance!

ala_747
  • 611
  • 4
  • 10
  • Url should not contain spaces. So, the simplest way is remove them from the second patern – splash58 May 19 '15 at 19:25
  • Actually, I need to do nothing with the url, the problem is to **avoid** the matches inside the `![...]` part for the first option... or get **only** those inside that `![...]` part for the second one. I think it would be clear looking the posted examples. – ala_747 May 19 '15 at 19:41
  • Is the `![Blah` always starting its own line? – chris85 May 19 '15 at 20:19
  • Yes, we assume it is, for sure. – ala_747 May 19 '15 at 20:22

3 Answers3

2

Well I modified first expression a little bit as I thought there are some extra capturing groups then made them by adding a lookahead trick:

-First one (Live demo):

\b(vitae)\b(?![^[]*]\s*\()

-Second one (Live demo):

{{([^}}]+)}}\[\[[^\]\]]+\]\](?=[^[]*]\s*\()

Lookahead part explanations:

(?!            # Starting a negative lookahead
    [^[]*]     # Everything that's between brackets
    \s*        # Any whitespace
    \(         # Check if it's followed by an opening parentheses  
)              # End of lookahead which confirms the whole expression doesn't match between brackets

(?= means a positive lookahead

revo
  • 47,783
  • 14
  • 74
  • 117
1

You can leverage the discard technique that it really useful for this cases. It consists of having below pattern:

patternToSkip1 (*SKIP)(*FAIL)|patternToSkip2 (*SKIP)(*FAIL)| MATCH THIS PATTERN

So, according you needs:

to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax

You can easily achieve this in pcre through (*SKIP)(*FAIL) flags, so for you case you can use a regex like this:

\[.*?\](*SKIP)(*FAIL)|\bTheWord\b

Or using your pattern:

\[.*?\](*SKIP)(*FAIL)|(\W)(theWord)(\W)

The idea behind this regex is tell regex engine to skip the content within [...]

Working demo

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • True, SKIP-FAIL is the correct trick for the 1st option. However, you should have used `\b` around `theWord` since the `(\W)` is redundant overhead. – Wiktor Stribiżew May 19 '15 at 21:05
  • @stribizhev FAIL is the easier-to-read alternative to a negative lookahead `(?!)` and not a magic – revo May 19 '15 at 21:12
  • @stribizhev You said *SKIP-FAIL is the correct trick* and I mentioned it's nothing but an alternative. – revo May 19 '15 at 21:16
  • @stribizhev Gosh I'm not comparing between alternatives and that's like those guys who can't stick with their girl friend: *-You know, she's a much better alternative.* – revo May 19 '15 at 21:27
  • @stribizhev **Please avoid extended discussions in comments.** Okay SO, I just wanna say one more thing... don't rationalize! – revo May 19 '15 at 21:34
0

The first regex is easily fixed with a SKIP-FAIL trick:

\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b

To replace with the word of your choice. It is a totally valid way in PHP (PCRE) regex to match something outside some markers.

See Demo 1

As for the second one, it is harder, but acheivable with \G that ensures we match consecutively inside some markers:

(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))

To replace with $1$2{{NEW_REPLACED_TEXT}}[[NEW_DIGITS]]

See Demo 2

PHP:

$re1 = "#\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b#i";
$re2 = "#(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))#i";
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • To be more accurate on patterns, to your second regex catastrophic backtracking is very likely to happen and I have a question: why atomic groups? Engine says *4 matches - 18720 steps*. Don't you think that you are killing the performance by the way? – revo May 19 '15 at 21:59
  • Once you come across catastrophic backtracking with this regex, please let me know. I am always more worried with accuracy and stability. What good is a quick regex that does not match what you need or overfire? – Wiktor Stribiżew May 19 '15 at 22:03
  • Like when there is a space between `] (`. Spaces are escaped within reserved symbols by the markdown parser and would you mind answering why you came with using atomic groups here? – revo May 19 '15 at 22:15
  • Actually, the `](` *is* part of the markdown. If there are spaces between, they must be added into the pattern. The regex will not fail in case there are stray brackets inside the*text*, not *markdown*. Also, atomic groupings are useful when we are not interested in [backtracking positions remembered by any tokens inside the group](http://www.regular-expressions.info/atomic.html). – Wiktor Stribiżew May 19 '15 at 22:28
  • Forgetting backtracking positions - as I'm aware - is useful when you have multiple piped sequences inside the atomic group, and you don't have any. You're just checking a single expression at a time. So it's totally useless. Isn't it? – revo May 19 '15 at 22:37
  • Tastes differ, like regex flavors. There is no difference here if I use a non-capturing group or atomic one. – Wiktor Stribiżew May 19 '15 at 22:47
  • Good night, I am off to bed – Wiktor Stribiżew May 19 '15 at 22:47
  • I got my answer! Good night! – revo May 19 '15 at 22:48