5

We are stuck over a regex issue.

Here is the problem. Consider the following two patterns:

1) [hello] [world]

2) [hello [world]]

We need to write a regex able to match only [world] in the first one and the entire pattern ([hello [world]]) in the second.

By using the negative lookahead, I wrote the following regex which solves part of the problem:

\[[^\[\]]+\](?!.*\[[^\[\]]+\])

This regex matches pattern 1) as we want, but does not work for pattern 2).

halfer
  • 19,824
  • 17
  • 99
  • 186
Enrico Massone
  • 6,464
  • 1
  • 28
  • 56
  • Which language/tool do you use? Recursion (if supported) is sually the key to matching nested brackets. – Sebastian Proske Oct 20 '17 at 13:05
  • Hi Sebastian and thanks for helping. We are using .NET. – Enrico Massone Oct 20 '17 at 13:06
  • Make a group with both patterns in it: (pattern|pattern) – Jacob Bruinsma Oct 20 '17 at 13:06
  • Possible duplicate of [How to get text between nested parentheses?](https://stackoverflow.com/questions/19693622/how-to-get-text-between-nested-parentheses) – Sebastian Proske Oct 20 '17 at 13:09
  • Actually the words "hello" and "world" are just examples words. The problem is general and the point is being able to handle nested square bracktes in such a scenario. – Enrico Massone Oct 20 '17 at 13:11
  • Are brackets always balanced? – Casimir et Hippolyte Oct 20 '17 at 13:18
  • Hi @CasimiretHippolyte and thanks for replying. No, the brackets must not be balanced. For instance a pattern like `[Hello world\]]` is a valid text for a markdown link and we should try to support it. – Enrico Massone Oct 20 '17 at 13:45
  • Something like this: [demo](http://regexstorm.net/tester?p=%5c%5b%28%3f%3e%5b%5e%5d%5b%5c%5c%5d%2b%7c%5c%5c.%7c%28%3f%3cc%3e%29%5c%5b%7c%28%3f%3c-c%3e%29%5d%29*%5d%28%3f%28c%29%28%3f!%29%29%28%3f!%5b%5e%5b%5cn%5d*%5c%5b%28%3f%3e%5b%5e%5d%5cn%5c%5c%5d%2b%7c%5c%5c.%29*%5d%29&i=%5bhello%5d+%5bworld%5d%0d%0a%5bhello+%5bworld%5d%5d%0d%0a%5bhello+%5bworld%5d%5d+%5bhello+%5bworld%5d%5d+%5bhello+%5bworld%5d%5d%0d%0a%5bHello+world%5c%5d%5d+%5bHello+world%5c%5d%5d%0d%0a%5bhello%5d+%5bworld%5d+%5bhello+%5bworld%5d%5d+%5b%5b%5b%5d%0d%0a%5bhello%5d+%5bworld%5d+%5bhello+%5bworld%5d%5d+%5b+%5b+%5b%0d%0a) – Casimir et Hippolyte Oct 20 '17 at 14:52

3 Answers3

2

In .NET regex, you may use balanced groups to match nested balanced parentheses. So, to match the last [...] substring (with nested parentheses) on a line you need quite a long pattern like

\[(?:[^][]+|(?<c>)\[|(?<-c>)])*(?(c)(?!))](?!.*\[(?:[^][]+|(?<d>)\[|(?<-d>)])*(?(d)(?!))])

See the regex demo at RegexStorm.net.

Details

  • \[(?:[^][]+|(?<c>)\[|(?<-c>)])*(?(c)(?!))] - a [...] substring with nested brackets:
    • \[ - a [ char
    • (?:[^][]+|(?<c>)\[|(?<-c>)])* - zero or more occurrences of:
      • [^][]+| - 1 or more chars other than ] and [ or
      • (?<c>)\[| - empty value added to Group "c" and a [ is matched
      • (?<-c>)] - empty value is subtracted from Group "c" stack and a ] is matched
    • (?(c)(?!)) - a conditional that fails the match if Group "c" stack is not empty
    • ] - a ] char
  • (?!.*\[(?:[^][]+|(?<d>)\[|(?<-d>)])*(?(d)(?!))]) - not followed with any 0+ chars other than newline symbols followed with the same pattern as the one above.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for quickly replying our question. We don't know the whole idea of Balanced Group so, at the least, your hint could help us to improve our regex skills. We are not sure to follow this idea to solve our problem, it seems that it is computational expensive and we need to parse a whole text looking for markdown links in order to replace them. Such a text could be quite long. Moreover, it is possible that we have to deal with more complex patterns, such as [Hello [world] or [Hello world]] (scenarios where the number of square brackets is odd. – Enrico Massone Oct 20 '17 at 13:27
  • Your lookahead is complicated for nothing, to check if there isn't one other balanced group before the end of the string, you only have to check it with `(?![^[]*\[[^]]*])`. You don't need a detailed description. – Casimir et Hippolyte Oct 20 '17 at 13:29
  • @CasimiretHippolyte That is not generic enough as it only allows 1 nested level. See [your suggestion demo](http://regexstorm.net/tester?p=%5c%5b%28%3f%3a%5b%5e%5d%5b%5d%2b%7c%28%3f%3cc%3e%29%5c%5b%7c%28%3f%3c-c%3e%29%5d%29*%28%3f%28c%29%28%3f!%29%29%5d%28%3f!.*%28%3f!%5b%5e%5b%5d*%5c%5b%5b%5e%5d%5d*%5d%29%29&i=%5bhello%5d+%5bworld%5d%0d%0a%5bhello+%5bworld%5d%5d). – Wiktor Stribiżew Oct 20 '17 at 13:30
  • You don't need to describe other nesting levels in the lookahead. only one suffice as a proof. test it. – Casimir et Hippolyte Oct 20 '17 at 13:31
  • @CasimiretHippolyte when you test your solution, and are sure it works for all the test cases, you may update this wiki answer. I tested and it does not work (see my link above). – Wiktor Stribiżew Oct 20 '17 at 13:33
  • @WiktorStribiżew: no, you haven't understood, [this is my "suggestion demo"](http://regexstorm.net/tester?p=%5c%5b%28%3f%3e%5b%5e%5d%5b%5d%2b%7c%28%3f%3cc%3e%29%5c%5b%7c%28%3f%3c-c%3e%29%5d%29*%5d%28%3f%28c%29%28%3f!%29%29%28%3f!%5b%5e%5b%5cn%5d*%5c%5b%5b%5e%5d%5cn%5d*%5d%29&i=%5bhello%5d+%5bworld%5d%0d%0a%5bhello+%5bworld%5d%5d) – Casimir et Hippolyte Oct 20 '17 at 13:38
  • Ok, then it makes sense. – Wiktor Stribiżew Oct 20 '17 at 13:40
0

Here another possible solution to match all markdown links if "correctly" escaped.

Here the regex:

\[(?<text>(?:[^\[\]]|\\\[|\\\])+?)\]\((?<link>.+?)\)

See regex 101 demo.

Note that this not support NOT escaped brackets inside links:

[link number \[2]](http://myurl.com)
[link number [2\]](http://myurl.com)

It may also not support other edge cases...

Davide Icardi
  • 11,919
  • 8
  • 56
  • 77
0

A more simple way to find the last balanced square brackets part in a string with the .net regex engine is to search the string from right to left using the Regex.RightToLeft property. This way you avoid:

  • to search all the string for nothing
  • to check the end of the string with a lookahead since the pattern returns the first match on the right.

code:

string input = @"[hello] [world] [hello [world\]] ]";
string rtlPattern = @"(?(c)(?!))\[(?>\\.|(?<!\\)[^][]+|(?<-c>)\[|(?<c>)])*]";
Match m;

m = Regex.Match(input, rtlPattern, RegexOptions.RightToLeft);

if (m.Success)
    Console.WriteLine("Result: {0}", m.Groups[0].Value);

demo

Note that to well understand what happens you also need to read parts of the pattern from right to left. Details:

]  # a literal closing square bracket

(?> # open an atomic group (*)
    \\.         # any escaped character with a backslash
  |
    [^][]+  # all that isn't a square bracket
    (?<!\\) # not preceded by a backslash
  |
    (?<-c>) \[  # decrement the c stack for an opening bracket
  |
    (?<c>)   ]  # increment the c stack for a closing bracket
)* # repeat zero or more times

\[  # a literal square opening bracket

(?(c) # conditional statement: true if c isn't empty
    (?!) # always failing pattern: "not followed by nothing"
)

(*) Note that using an atomic group is mandatory here to avoid an eventual catastrophic backtracking since the group contains an item with a + quantifier and is itself repeated. You can learn more about this problem here.

This pattern already deals with escaped nested brackets and you can also add the Regex.Singleline property if you want to match a part that includes the newline character.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125