how to get a submatch of a match in python regex

Question

I have a string like this:

str='< TOPICS>< D>cocoa< /D>< /TOPICS>< PLACES>< D>el-salvador< /D>< D>usa< /D>< D>uruguay< /D>< /PLACES>'

I want to get the string between < D> and < /D> in < PLACES> and < /PLACES>. I have known the flowing:

p1=re.compile(r'(?<=<PLACES>)(.*?)(?=</PLACES>)')
p2=re.compile(r'(?<=<D>)(.*?)(?=</D>)')

with p1 and p2,I can get el-salvador,usa,uruguay.But how can I get the info with only a p.

[a must-read](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — shx2, Jan 03 '14 at 17:53

Jerry · Answer 1 · 2014-01-04T08:49:12.183

1

You can use a regex like this one:

(?<=<D>)([^<>]*)(?=</D>)(?=(?:(?!<PLACES>).)*</PLACES>)

regex101 demo

Where the positive lookahead (?=(?:(?!<PLACES>).)*</PLACES>) makes sure there's a </PLACES> somewhere ahead, without any opening <PLACES> in between what is matched and that closing tag.

But you really should consider using a proper parser, such as BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> text = '<TOPICS><D>cocoa</D></TOPICS><PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>'
>>> soup = BeautifulSoup(text)
>>> for m in soup.find_all('d'):
...     if m.parent.name == 'places':
...         print(''.join(m))
...
el-salvador
usa
uruguay

EDIT: As suggested by JonClements in the comments, you can also use:

>>> for m in soup.select('places d'):
...     print(''.join(m))
...
el-salvador
usa
uruguay

edited Jan 04 '14 at 08:49

answered Jan 03 '14 at 18:13

Jerry

70,495
13
100
144

1

Alternatively, use select... `for m in soup.select('places d')` – Jon Clements Jan 03 '14 at 18:17
@JonClements Awesome! – Jerry Jan 03 '14 at 18:19
I don't want to parse the whole xml,but only the text in some specified label.I think Regex is faster.Thanks for your advice.:) – Lambda Jan 05 '14 at 03:43

how to get a submatch of a match in python regex

1 Answers1