To emphasize, I do not want to "parse using a regex" - I want to "parse a regex into a symbolic tree." (Searching has only brought up the former...)
My use case: to speed up a regex search over a database, I'd like to parse a regex like (foo|bar)baz+(bat)*
and pull out all substrings that MUST appear in a match. (In this case, it's just baz
because foo/bar are alternations and bat can appear 0 times.)
To do this, I need some understanding of regex operators/semantics. re.DEBUG
comes closest:
In [7]: re.compile('(foo|bar)baz+(bat)', re.DEBUG)
subpattern 1
branch
literal 102
literal 111
literal 111
or
literal 98
literal 97
literal 114
literal 98
literal 97
max_repeat 1 4294967295
literal 122
subpattern 2
literal 98
literal 97
literal 116
However, it's just printing out, and the c-implementation doesn't preserve the structure afterwards as far as I can tell. Any ideas on how I can parse this out without writing my owner parser?