How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

Question

Given a string like this:

ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",

With regex, how do I get a tuple that looks like the following:

('ORTH', ['cali.ber,kl','calf','done'])

I've been doing it as such:

txt = '''ORTH < "cali.ber,kl", 'calf' , "done" >,'''
e1 = txt.partition(" ")[0]
vs = re.search(r"<([A-Za-z0-9_]+)>", txt)
v = vs.group(1)
v1 = [i[1:-1] for i in vs.strip().strip("<>").split(",")]
print v1

But i'm getting none for re.search().group(1). How should it be done to get the desired output?

Lukas Graf · Accepted Answer · 2013-08-07T12:51:02.437

The reason you don't get a match is that your regex doesn't match:

r"<([A-Za-z0-9_]+)>" is missing comma, quotation marks and the space character, which all can occur inside the < > according to your sample.

This one would match:

re.search(r"< ([A-Za-z0-9_.,\"' ]+) >", txt)

What also may trip you up is the fact that the list of names is delimited by comma, which itself can be part of the values, unescaped.

That means you can't just split that string by ',', but instead need to consider the two different quotation characters(' and " ) in order to separate the fields.

So I'd use this approach:

Use re.match to split the string into PREFIX < NAMES > parts, and discard the rest.
Use re.findall() to split the names into fields according to quotation marks

Edit:

1) According to your first comment, your data can also contain a preamble before the prefix that contains newlines. The default behavior for . is to match everything except newlines.

From the Python re docs:

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

So you need to construct that regex with the re.DOTALL flag. You do this by compiling it first and passing the ORed flags:

re.compile(pattern, flags=re.DOTALL)

2) If you include the space character before PREFIX in the regex, it will only match for data that actually contains that space - but not anymore for your first piece of example data. So I use .*?([A-Z\.]*)... to cover both cases. The ? is for non-greedy matching, so it matches the shortest possible match instead of the longest.

3) To cover PREFIX.FOO just extend the pattern for the prefix to ([A-Z\.]*) by including the . character and escaping it.

Updated example covering all the cases you mentioned:

import re

TEST_VALUES = [
    """ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",""",
    """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""
]

EXPECTED = ('ORTH.FOO', ['cali.ber,kl','calf','done'])


pattern = re.compile(r'.*?([A-Z\.]*) < (.*) >.*', flags=re.DOTALL)


for value in TEST_VALUES:
    prefix, names_str = pattern.match(value).groups()
    names = re.findall('[\'"](.*?)["\']', names_str)

    result = prefix, names
    assert(result == EXPECTED)

print result

Thanks, the regex works for the given string above but it's not working for my data because i had some other chars before the prefix, e.g. `"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""`. I've tried to catch the pattern with `r'.* ([A-Z]*) < (.*) >.*'` but it isn't working too =( — alvas, Aug 07 '13 at 08:58
sometimes my data gives `"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ PHON.POO "whatever",\n ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""` too... — alvas, Aug 07 '13 at 09:02
i don't mind catching `('PHON.POO', 'whatever')` in the `(prefix, name_str)` but i can't seem to be getting the `CAPS.PREFIX < 'x', "y">` — alvas, Aug 07 '13 at 09:04
@2er0 updated my answer to cover the additional cases you provided. — Lukas Graf, Aug 07 '13 at 12:53

How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

1 Answers1

Edit:

Linked