4

Lots of questions on how to tokenize some string using regexp.

I'm looking however, to tokenize the regexp pattern itself, I'm sure there are some posts on the subject but I cannot find them.

Examples:

^\w$                    -> ['^', '\w', '&']
[3-7]*                  -> ['[3-7]*']
\w+\s\w+                -> ['\w+', '\s', '\w+']
(xyz)*\s[a-zA-Z]+[0-9]? -> ['(xyz)*','\s','[a-zA-Z]+','[0-9]?']

I'm assuming this work is done in python under the hood when some regexp function is called.

idanshmu
  • 5,061
  • 6
  • 46
  • 92
  • 2
    You can think of regular expressions as shorthand for finite state machines: http://stackoverflow.com/questions/525004/short-example-of-regular-expression-converted-to-a-state-machine – Andrei Savin Feb 02 '17 at 16:44
  • One of the problems is constructions such as `[-z]` vs. `[a-z]` – meta codes change their meaning, depending on their position. A few other ones: `[*]*` and `[[]`. I think strict left-to-right parsing is the only way to do so. – Jongware Feb 02 '17 at 16:50
  • BTW Python uses [boost](http://www.boost.org/) under the hood. – Jongware Feb 02 '17 at 16:51
  • From the question linked by @AndreiSavin, [this answer](http://stackoverflow.com/a/525068/2877364). Also, [the parser](https://github.com/kkos/oniguruma/blob/master/src/regparse.c) from [oniguruma](https://github.com/kkos/oniguruma) ... but that's not exactly implemented in python :) . – cxw Feb 02 '17 at 16:59

1 Answers1

0

One place to start: the PyPy project has an implementation of Python in (mostly) Python. The re.py in the source distribution calls sre-compile.c:_compile() to do the work. You might be able to hack that to provide the output form you want.

Edit Also, the Javascript XRegExp library parses regexes in an extended syntax and renders them to standard syntax. The parser routine may help you.

cxw
  • 16,685
  • 2
  • 45
  • 81