UPDATE
This is still not entirely the solution so far. It is only for preceding repeated closing characters (e.g
))
,]]
,}}
). I'm still looking for a way to capture enclosed contents and will update this.
Code:
>>> import re
>>> re.search(r'(\(.+?[?<!)]\))', '((x(y)z))', re.DOTALL).groups()
('((x(y)z))',)
Details:
r'(\(.+?[?<!)]\))'
()
- Capturing group special characters.\(
and\)
- The open and closing characters (e.g'
,"
,()
,{}
,[]
).+?
- Match any character content (use withre.DOTALL
flag)[?<!)]
- The negative lookbehind for character)
(replace this with the matching closing character). This will basically find any)
character where\)
character does not precede (more info here).
I was trying to parse something like a variable assignment statement for this lexer thing I'm working with, just trying to get the basic logic behind interpreters/compilers.
Here's the basic assignment statements and literals I'm dealing with:
az = none
az_ = true
az09 = false
az09_ = +0.9
az_09 = 'az09_'
_az09 = "az09_"
_az = [
"az",
0.9
]
_09 = {
0: az
1: 0.9
}
_ = (
true
)
Somehow, I managed to parse those simple assignments like none
, true
, false
, and numeric literals. Here's where I'm currently stuck at:
import sys
import re
# validate command-line arguments
if (len(sys.argv) != 2): raise ValueError('usage: parse <script>')
# parse the variable name and its value
def handle_assignment(index, source):
# TODO: handle quotations, brackets, braces, and parenthesis values
variable = re.search(r'[\S\D]([\w]+)\s+?=\s+?(none|true|false|[-+]?\d+\.?\d+|[\'\"].*[\'\"])', source[index:])
if variable is not None:
print('{}={}'.format(variable.group(1), variable.group(2)))
index += source[index:].index(variable.group(2))
return index
# parse through the source element by element
with open(sys.argv[1]) as file:
source = file.read()
index = 0
while index < len(source):
# checks if the line matches a variable assignment statement
if re.match(r'[\S\D][\w]+\s+?=', source[index:]):
index = handle_assignment(index, source)
index += 1
I was looking for a way to capture those values with enclosed quotations, brackets, braces, and parenthesis.
Probably, will update this post if I found an answer.