0

I've tried to combine two Regex with AND but failed at the attempt.

  1. Pick up anything between '[[' and (']]' or '|') in direct succession :
(?<=(\[\[))(.*?)(?=(\||(\]\])))
  1. Doesn't contain 'http' :
^(?:(?!http).)*$

My best try was

(?=(?<=(\[\[))(.*?)(?=(\||(\]\]))))(?=^(?:(?!http).)*$).*$ 

Following https://stackoverflow.com/a/870506 but it is not working.

My goal is to get all the intenal links in a dokuwiki page typically : 'my_page', 'my_other_page', but not 'http://your_page' in :

[[my_page]]

[[my_other_page|this is my other page]]

[[http://your_page|this is your page]]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Try `\[\[(?!https?:\/\/)[^][|]*(?:\|[^][]*)?]]`, see [this regex demo](https://regex101.com/r/uEyJAP/1). – Wiktor Stribiżew Mar 24 '21 at 22:36
  • Can you just apply one regex after the next in your code? Sometimes that's easier than coming up with a Super RegEx... – xdhmoore Mar 24 '21 at 22:41
  • Thanks but I need to get rid of the brackets as well I want to match 'my_page' and 'my_other_page' only. – FractalCitta Mar 24 '21 at 22:41
  • Yes I thought about that : applying another regex afterward. but the first result would be store in an array and so I would have to do loop to all the results and check them, a bit more ugly, but I can do that. – FractalCitta Mar 24 '21 at 22:43
  • Is it OK for http to exist outside the delimiters? eg `[[my_other_page|foo http bar]]` should match `my_other_page`? Does matching input always start/end with delimiters? – Bohemian Mar 24 '21 at 22:47
  • To @Bohemian, yes it is ok. In dokuwiki all external link start by http in between '[[' and (']]' or '|') . If there is no http it means it is an internal link. So [[my_other_page|foo http bar]] should match my_other_page. – FractalCitta Mar 24 '21 at 22:52
  • What is the tool or language? – The fourth bird Mar 24 '21 at 23:03
  • @Thefourthbird PHP – FractalCitta Mar 24 '21 at 23:11

3 Answers3

1

Then use

(?<=\[\[)(?!https?:\/\/)[^][|]+
(?<=\[\[)(?!https?:\/\/)[^][|]+(?=(?:\|[^][]*)?]])

See the regex demo

Details:

  • (?<=\[\[) - a positive lookbehind that matches a location immediately preceded with [[
  • (?!https?:\/\/) - a negative lookahead that cancels the match if there is http:// or https:// immediately to the right of the current location
  • [^][|]+ - one or more chars other than ], [ and |
  • (?=(?:\|[^][]*)?]]) - a positve lookahead that requires the following sequence of patterns immediately to the right of the current location:
    • (?:\|[^][]*)? - an optional occrurrence of a | and then any zero or more chars other than [ and ]
    • ]] - a ]] string.

NOTE: Depending on the regex flavor, you may need to escape ] or/and [ chars in the character class, i.e. [^][] => [^\][] (JavaScript RegExp) or [^\]\[] (Java, Ruby).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

As an alternative, you could make use of a SKIP FAIL approach:

\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+

The pattern matches:

  • \[\[https?:\/\/ Match [[https:// with optional s
  • (*SKIP)(*FAIL) Consument the characters that you want to avoid
  • | Or
  • \[\[\K Match [[ and forget what is matched so far
  • [^][|]+ Match 1+ times any char except ] [ or |

Regex demo

$strings = [
    "[[my_page]]",
    "[[my_other_page|this is my other page]]",
    "[[http://your_page|this is your page]]",
];

$re = '/\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+/';

foreach ($strings as $s){
    if (preg_match($re, $s, $matches)) {
        var_dump($matches[0]);
    }    
}

Output

string(7) "my_page"
string(13) "my_other_page"

To verify the optional part with | and the closing ]] you can use a negative lookahead

\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+(?=(?:\|[^][]*)?]])

Regex demo

Or if the last part can also contain ] or [

\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+(?=(?:\|.*?)?]])

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

Here is the Regex Pattern with some adjustments to Wiktor Stribiżew's answer in the comment section. By using lookahead and lookbehind, you can deselect the brackets.

(?<=\[\[)(?!https?:\/\/)[^][|]*(?:\|[^][]*)?(?=]])
  • 1
    Almost perfect thank you very much. It just grabs '| this is my other page' and I would like to not have it. The goal is to have only the first part of what follows '[[' and not what is after '|' if there is one. I have to admit that I don't understand much about all of this regex, but thank you so much – FractalCitta Mar 24 '21 at 22:56