17

To emphasize, I do not want to "parse using a regex" - I want to "parse a regex into a symbolic tree." (Searching has only brought up the former...)

My use case: to speed up a regex search over a database, I'd like to parse a regex like (foo|bar)baz+(bat)* and pull out all substrings that MUST appear in a match. (In this case, it's just baz because foo/bar are alternations and bat can appear 0 times.)

To do this, I need some understanding of regex operators/semantics. re.DEBUG comes closest:

In [7]: re.compile('(foo|bar)baz+(bat)', re.DEBUG)
subpattern 1
  branch
    literal 102
    literal 111
    literal 111
  or
    literal 98
    literal 97
    literal 114
literal 98
literal 97
max_repeat 1 4294967295
  literal 122
subpattern 2
  literal 98
  literal 97
  literal 116

However, it's just printing out, and the c-implementation doesn't preserve the structure afterwards as far as I can tell. Any ideas on how I can parse this out without writing my owner parser?

munchybunch
  • 6,033
  • 11
  • 48
  • 62
  • 2
    how about using a regex over the regeg pattern? – Netwave Dec 30 '15 at 08:18
  • 5
    @DanielSanchez You can't parse regular expressions with a regular expression. – BlackJack Dec 30 '15 at 13:34
  • @BlackJack, you can regex the regex string, I mean if i have "1|2" for my regex y can regex that string. – Netwave Dec 30 '15 at 13:43
  • 1
    @DanielSanchez You can do that for `1|2` but not for arbitrary regular expressions. You can't turn a regular expression into a symbolic tree like the question asks, you need a parser for a context free grammar lika Ira Baxter's answer explains. – BlackJack Dec 30 '15 at 13:52
  • @BlackJack, yeah, I mean, I knew is way more complicated than that, I was thinking about a regex introspection or something that would give you info about what it can parse. It's hard to explain, but yes, this is just my head imaging stuff, not a solution, thats why I used the comments hehe. Thanks anyway :) – Netwave Dec 30 '15 at 14:00
  • 2
    https://xkcd.com/1313/ ? – Peter Schneider Dec 30 '15 at 15:28

2 Answers2

7

You can maybe just use this:

import sre_parse
sre_parse.parse(r'(\d+)foo(.*)')
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
boxed
  • 3,895
  • 2
  • 24
  • 26
2

You can only specify a (classic) regex using a context free grammar:

 regex = { alternatives };
 alternatives =  primitive { '|' alternatives } ;
 primitive = '(' regex ')' | '[' character_set ']' | ...

This means you can't parse a regex using a regex (Perl is an exception, but then its "regexes" are extended way beyond "classic").

So, to parse a regex, you'll need to build your own parser and constructs some kind of tree (re.Debug comes pretty close) or that magic library you are hoping for.

I suspect this is the easy part. This isn't terribly hard to do yourself; see Is there an alternative for flex/bison that is usable on 8-bit embedded systems? for a straightforward scheme for building such parsers.

To understand the semantics of the regex (e.g., to figure out "necessary substrings"), you might be able to get away with building an analyzer the walks over the parse tree, and for each subtree (bottom up), computes the common-string. Failing that you may have to implement the classic NDFA construction and then walk over it, or implement the NDFA to DFA construction and walk over the DFA. Real regexes tend to contain lots of messy complications such as built-in character sets, capture groups, etc.

The "common string" might not be just a contiguous sequence of characters although you could define it narrowly as such. It might include several constant substrings separated by fixed or variable length gaps of characters, e.g., your necessary substring might always itself be expressible as a "simple regex" of the form:

   (<character>+ ?+) <character>+
Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Yeah, I was hoping there was some regex library that let me walk over the NDFA or parse tree; I've used ANTLR and the like a few times and don't miss it at all... re: the "simple regex", you hit complications with examples like `(ab+)*`, where there are no required substrings at the end of the day. Anyhow, thanks for the perspective, this is useful (although going to keep question open in case anyone has ideas to save me from parsing myself) – munchybunch Dec 30 '15 at 22:07