0

I'm looking for a regex to extract the content between [ and ] that may include escaped characters. Some examples are given below.

+------------------+----------------+
|      Input       |     Output     |
+------------------+----------------+
| [B]              | B              |
| [B][C][D]        | B              |
| [hello\t\tworld] | hello\t\tworld |
| [hello\n\nworld] | hello\n\nworld |
| [\\]]            | \\]            |
| [\\\\]           | \\\\           |
| [x[y\\]z][foo]   | x[y\\]z        |
+------------------+----------------+

For strings like [B][C][D], returning the smallest match is fine, since the desired pattern p will be matched iteratively as p+. It seems like a negated positive lookbehind, but I don't know if such a thing exists (i.e., consume until you see a ] not preceded by one or more \\).

Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
  • Or [from this answer](https://stackoverflow.com/a/21062778/5527985): [`\[((?:(?:\\\\)?.)*?)\]`](https://regex101.com/r/4PFyCy/1) – bobble bubble Jun 18 '23 at 20:07
  • Using [atomic group](https://www.regular-expressions.info/atomic.html): [`\[([^\\\]]*(?>\\\\?.\\?[^\\\]]*)*)\]`](https://regex101.com/r/bmoTob/1) – bobble bubble Jun 18 '23 at 21:15

3 Answers3

2

The regex you need is mostly the same as this one.

(?<=\[)            # Match something that follows a '[',
(?:                # which is either
  \\{2}            # an escaped backslash
  (?:[^\\]|\\{2})  # followed by a non-backslash or an escaped backslash
|                  # or
  [^\]]            # a non-closing-bracket
)+                 # 1 or more times,
(?=\])             # and precedes a ']'

Try it on regex101.com.

InSync
  • 4,851
  • 4
  • 8
  • 30
2

Here's a relatively short pattern that avoids the use of alternation:

\[(.+?)(?<!(?<!\\)\\{2})\]

https://regex101.com/r/nUu2tr/1

Edit:

Copied from a comment for completeness sake.

The nested negative lookbehind in this pattern operates on the final \], which is a string literal ]. The outer part of the block specifies that the matched ] should not match if it's preceded by two back slashes \. The inner part specifies that this rule should not apply if these two back slashes are themselves preceded by another back slash.

Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
CAustin
  • 4,525
  • 13
  • 25
  • Would you mind breaking the pattern down, and elaborating a bit? I see that there's a nested negative lookbehind, but don't quite understand how's it working. – Abhijit Sarkar Jun 18 '23 at 02:01
  • 1
    The nested negative lookbehind in this pattern operates on the final `\]`, which is a string literal `]`. The outer part of the block specifies that the matched `]` should not match if it's preceded by two back slashes ```\```. The inner part specifies that this rule should not apply if these two back slashes are themselves preceded by another back slash. – CAustin Jun 18 '23 at 04:14
  • CAustin, got it, thanks, and upvoted already. I’ve accepted the answer from InSync because he got here earlier :) – Abhijit Sarkar Jun 18 '23 at 04:36
1

You can use the following.

\[(.+?(?<!\\)|\\+)\]

I was unsure how to handle the [\\\\] condition, so I just added it as a plausible match.

Output

B
B
hello\t\tworld
hello\n\nworld
\\]
\\\\
x[y\\]z
Reilas
  • 3,297
  • 2
  • 4
  • 17