0

I'm trying to write a regex that can find & symbols that are not inside brackets. For example:

gotcha & symbol [but not that & one [even that & one] and not this & one] but this  is & ok

I end up with this regex, but I can't realize how to handle properly nested brackets:

&(?![^[]*])

Link for playing with this regex.

2 Answers2

1

I wouldn't use regex for nested constructs. Some regex flavours might be able to do it (see this post for details), but in your case, if you're already using python, a simple loop over your string will do:

input_str = "gotcha & symbol [but not that & one [even that & one] and not this & one] but this  is & ok"
bracket_stack = 0
found_symbols = []
for i, c in enumerate(input_str):
    if c == '[':
        bracket_stack += 1
    elif c == ']':
        bracket_stack -= 1
        if bracket_stack < 0:
            print('Unbalanced brackets! Check input string!')
            break
    elif c=='&' and bracket_stack==0:
        found_symbols.append((i, c))

for i, c in found_symbols:
    print(f'Found symbol {c} at position {i}')

Output:

Found symbol & at position 7
Found symbol & at position 87

This can easily be generalized for more brackets/parenthesis and symbols.

Tranbi
  • 11,407
  • 6
  • 16
  • 33
1

First off – @Tranbi is correct. You should not use regular expressions for this. It is much easier to understand and maintain using the method that they have provided.

With that disclaimer out of the way – it is possible to do this using the pattern matching provided by modern extensions to PCRE and company, available in the regex module (which is not in the standard library so you'd need to install it).

The technique in the linked post for matching balanced brackets gets you part of the way, but doesn't cover the fact that you're actually trying to match the parts of the string outside the brackets. This requires some verb trickery:

import regex as re

input_str = 'gotcha & symbol [but not that & one [even that & one] and not this & one] but this  is & ok'

for match in re.finditer(r'&|(\[(?:[^\[\]]+|(?1))*\])(*SKIP)(*FAIL)', input_str):
    print(f"Found symbol {match.group(0)} at position {match.span()}")

Output:

Found symbol & at position (7, 8)
Found symbol & at position (87, 88)

We can unpack the pattern a bit:

r'''(?x)
 &            # The pattern that we're looking for - just the ampersand
|             # ... or ...
 (            # Capturing group #1, which matches a balanced bracket group
  \[          # which consists of a square bracket ...
   (?:
    [^\[\]]+  # ... followed by any number of non-bracket characters ...
    | (?1)    # ... or a balanced bracket group (i.e. recurse, to match group #1) ...
   )*
  \]          # ... and then the matching end bracket.
 )            # End of capturing group #1.
              # BUT we don't want to match anything between brackets, so ...
 (*SKIP)      # ... instruct the regex engine to discard what we matched ...
 (*FAIL)      # ... and tell the engine that it wasn't really a match anyway.
'''

So there you go! Fancy patterns for pattern matching abuse. Once again – a little hand-rolled parser is by far the better way to solve this problem, but it is fun to check in every few years on what you can do with the regex module.

motto
  • 2,888
  • 2
  • 2
  • 14