Recursively capture patterns in regex - Python

Question

Given the solution in How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?, I was able to capture the prefix and the values of the desired pattern denoted by a CAPITALIZED.PREFIX and values within angle brackets < "value1" , "value2", ... >

"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""

However I get into problems with i have strings like the one above. The desired output would be:

('ORTH.FOO', ['cali.ber,kl','calf','done'])
('ORHT2BAR', ['what so ever >', 'this that mess < up'])
('JOKE', ['whathe ', 'what'])

I have tried the following but it only give me the 1st tuple, how do i get all possible tuples as in the desired output?:

import re
intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up">\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30" ]."""
pattern = re.compile(r'.*?([A-Z0-9\.]*) < ([^>]*) >.*', flags=re.DOTALL)
f, v = pattern.match(intext).groups()
names = re.findall('[\'"](.*?)["\']', v)
print f, names

Regular expressions **cannot** capture information recursively. You'll have to parse the content twice instead. — Martijn Pieters, Aug 12 '13 at 09:18
so i have to parse till i read the character index of the first capture and then reparse from that index to the end of the string. and do it recursively till my `groups()` returns `None`? — alvas, Aug 12 '13 at 09:20
As Marijn said, your input isn't a regular language so you can't use regular expressions. Just write a small state machine for parsing the input, shouldn't be more than 20something lines... — l4mpi, Aug 12 '13 at 09:21
I'm not sure why `re.findall` is not capturing everything on my machine, but [this regex](http://www.regex101.com/r/jR8uX1) is working on regex101. Otherwise, `re.findall` is extracting the first two parts of your desired output on my machine. — Jerry, Aug 12 '13 at 10:35

score 1 · Answer 1 · answered Aug 12 '13 at 09:28

Regular expressions do not support 'recursive' parsing. Process the group between the < and > characters after capturing it with a regular expression.

The shlex module would do nicely here to parse your quoted strings:

import shlex
import re

intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up">\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30" ]."""
pattern = re.compile(r'.*?([A-Z0-9\.]*) < ([^>]*) >.*', flags=re.DOTALL)
f, v = pattern.match(intext).groups()

parser = shlex.shlex(v, posix=True)
parser.whitespace += ','
names = list(parser)

print f, names

output:

ORTH.FOO ['cali.ber,kl', 'calf', 'done']

You can use the recursive pattern in the [regex](https://pypi.python.org/pypi/regex) module (it's not supported in re), see http://stackoverflow.com/q/26385984/1240268... though I'm not sure if it helps in this (confusing) example of splitting. — Andy Hayden, Oct 15 '14 at 22:42

score 1 · Accepted Answer · answered Aug 12 '13 at 11:57

1

Huh silly me. Somehow, I wasn't testing the whole string on my machine ^^;

Anyway, I used this regex and it works, you just get the results you were looking for in a list, which I guess is okay. I'm not too good in python, and don't know how to transform this list into array or tuple:

>>> import re
>>> intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""
>>> results = re.findall('\\n .*?([A-Z0-9\.]*) < *((?:[^>\n]|>")*) *>.*?(?:\\n|$)', intext)
>>> print results
[('ORTH.FOO', '"cali.ber,kl", \'calf\', "done"'), ('ORHT2BAR', '"what so ever>", "this that mess < up"'), ('JOKE', '\'whatthe \', "what" ')]

The parentheses indicate the first level elements and the single quotes the second level elements.

answered Aug 12 '13 at 11:57

Jerry

70,495
13
100
144

interesting that you don't need the `re.DOTALL` flag, because you put int the `\n` into the regex. – alvas Aug 12 '13 at 12:09
@2er0 Well, it seemed that `\n` is not inserting newlines in the `intext`, so I matched the literal `\n` instead. And the `\n` was actually when I was testing some stuff out and forgot to remove it, oops! Not that it hinders the regex in any way though. I was trying the regex on an `intext` with `\n` as true new lines when I put the `\n` there. – Jerry Aug 12 '13 at 12:29
The regex without the extra `\n`: `\\n .*?([A-Z0-9\.]*) *< *((?:[^>]|>")*) *>.*?(?:\\n|$)` and the [demo](http://www.regex101.com/r/oY2pS6). Also, maybe worth noting that I'm explicitly allowing `>"` within the `< ... >` part. Not sure if this might cause a problem, but the patterns seems that the 'true' `>` is followed shortly by a comma (with optional space in between). – Jerry Aug 12 '13 at 12:47
yeah, the "true" `>` does is signalled by a comma (w|w/o a space). Care to explain the part on the 'non-capturing group', `.*?(?:\\n|$)` ? – alvas Aug 12 '13 at 16:01
1

@2er0 Sure. You can remove it on regex101 (the link in my previous comment named 'demo') and see what happens. It is basically there to ensure that the stuff being matched is between `\n` (or at the end of the string since there would be no `\n` there. You'll observe that without it, `<20>",\nLOOSE.SCREW ">` is considered as one match (see `([A-Z0-9\.]*)` that it can be absent too). Actually, I just found out a shortcoming which might or mightn't cause problems. Do the things you are matching alternate? See [this edited regex](http://www.regex101.com/r/gG9jS6). – Jerry Aug 12 '13 at 16:32

Recursively capture patterns in regex - Python

2 Answers2