1

UPDATE

This is still not entirely the solution so far. It is only for preceding repeated closing characters (e.g )), ]], }}). I'm still looking for a way to capture enclosed contents and will update this.

Code:

>>> import re
>>> re.search(r'(\(.+?[?<!)]\))', '((x(y)z))', re.DOTALL).groups()
('((x(y)z))',)

Details:

r'(\(.+?[?<!)]\))'
  • () - Capturing group special characters.
  • \( and \) - The open and closing characters (e.g ', ", (), {}, [])
  • .+? - Match any character content (use with re.DOTALL flag)
  • [?<!)] - The negative lookbehind for character ) (replace this with the matching closing character). This will basically find any ) character where \) character does not precede (more info here).

I was trying to parse something like a variable assignment statement for this lexer thing I'm working with, just trying to get the basic logic behind interpreters/compilers.

Here's the basic assignment statements and literals I'm dealing with:

az = none
az_ = true
az09 = false
az09_ = +0.9
az_09 = 'az09_'
_az09 = "az09_"
_az = [
  "az",
  0.9
]
_09 = {
  0: az
  1: 0.9
}
_ = (
  true
)

Somehow, I managed to parse those simple assignments like none, true, false, and numeric literals. Here's where I'm currently stuck at:

import sys
import re

# validate command-line arguments
if (len(sys.argv) != 2): raise ValueError('usage: parse <script>')

# parse the variable name and its value
def handle_assignment(index, source):
    # TODO: handle quotations, brackets, braces, and parenthesis values
    variable = re.search(r'[\S\D]([\w]+)\s+?=\s+?(none|true|false|[-+]?\d+\.?\d+|[\'\"].*[\'\"])', source[index:])
    if variable is not None:
        print('{}={}'.format(variable.group(1), variable.group(2)))
        index += source[index:].index(variable.group(2))
    return index

# parse through the source element by element
with open(sys.argv[1]) as file:
    source = file.read()
    index = 0
    while index < len(source):
        # checks if the line matches a variable assignment statement
        if re.match(r'[\S\D][\w]+\s+?=', source[index:]):
            index = handle_assignment(index, source)
        index += 1

I was looking for a way to capture those values with enclosed quotations, brackets, braces, and parenthesis.

Probably, will update this post if I found an answer.

Toto
  • 89,455
  • 62
  • 89
  • 125
Küroro
  • 628
  • 6
  • 20

2 Answers2

1

Use a regexp with multiple alternatives for each matching pair.

re.match(r'\'.*?\'|".*?"|\(.*?\)|\[.*?\]|\{.*?\}', s)

Note, however, that if there are nested brackets, this will match the first ending bracket, e.g. if the input is

(words (and some more words))

the result will be

(words (and some more words)

Regular expressions are not appropriate for matching nested structures, you should use a more powerful parsing technique.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • I was thinking if it's possible to just create a set for both the open and closing characters and just _anything_ in the middle? Something like: `r"[\'\"\[\{\(]ANYTHING[\'\"\]\}\)]"` – Küroro Jan 14 '20 at 04:30
  • That will not necessarily match the corresponding end bracket for a starting bracket. So it could match `'XXX)` – Barmar Jan 14 '20 at 16:08
  • I think a simple `.*` instead of `.*?` for each pattern is what you would want? Am I missing something? – Nick Crews Feb 09 '22 at 20:42
  • If you use a greedy regexp, it can match too much. if the string is `'foo' (abc) 'bar'` the first alternative will match the entire thing instead of just `'foo'`. – Barmar Feb 09 '22 at 20:45
0

Solution for @Barmar's recursive characters using the regex third-party module:

pip install regex
python3
>>> import regex
>>> recurParentheses = regex.compile(r'[(](?:[^()]|(?R))*[)]')
>>> recurParentheses.findall('(z(x(y)z)x) ((x)(y)(z))')
['(z(x(y)z)x)', '((x)(y)(z))']
>>> recurCurlyBraces = regex.compile(r'[{](?:[^{}]|(?R))*[}]')
>>> recurCurlyBraces.findall('{z{x{y}z}x} {{x}{y}{z}}')
['{z{x{y}z}x}', '{{x}{y}{z}}']
>>> recurSquareBrackets = regex.compile(r'[[](?:[^][]|(?R))*[]]')
>>> recurSquareBrackets.findall('[z[x[y]z]x] [[x][y][z]]')
['[z[x[y]z]x]', '[[x][y][z]]']

For string literal recursion, I suggest take a look at this.

Küroro
  • 628
  • 6
  • 20
  • 1
    This only matches `()`, can you expand it to handle all the different bracketing pairs? – Barmar Jan 14 '20 at 16:09
  • @Barmar Yes. All you have to do is replace the escaped characters: `\(`, `\)`, as well as the the matching characters within the set: `[^()]`. Here's another example: `regex.findall(r'(\{(?>[^{}]+|(?R))*\})', '{z{x{y}z}x} {{x}{y}{z}}')` – Küroro Jan 15 '20 at 03:02