-2

I have a string like this:

str='< TOPICS>< D>cocoa< /D>< /TOPICS>< PLACES>< D>el-salvador< /D>< D>usa< /D>< D>uruguay< /D>< /PLACES>'

I want to get the string between < D> and < /D> in < PLACES> and < /PLACES>. I have known the flowing:

p1=re.compile(r'(?<=<PLACES>)(.*?)(?=</PLACES>)')
p2=re.compile(r'(?<=<D>)(.*?)(?=</D>)')

with p1 and p2,I can get el-salvador,usa,uruguay.But how can I get the info with only a p.

Lambda
  • 1
  • 1

1 Answers1

1

You can use a regex like this one:

(?<=<D>)([^<>]*)(?=</D>)(?=(?:(?!<PLACES>).)*</PLACES>)

regex101 demo

Where the positive lookahead (?=(?:(?!<PLACES>).)*</PLACES>) makes sure there's a </PLACES> somewhere ahead, without any opening <PLACES> in between what is matched and that closing tag.

But you really should consider using a proper parser, such as BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> text = '<TOPICS><D>cocoa</D></TOPICS><PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>'
>>> soup = BeautifulSoup(text)
>>> for m in soup.find_all('d'):
...     if m.parent.name == 'places':
...         print(''.join(m))
...
el-salvador
usa
uruguay

EDIT: As suggested by JonClements in the comments, you can also use:

>>> for m in soup.select('places d'):
...     print(''.join(m))
...
el-salvador
usa
uruguay
Jerry
  • 70,495
  • 13
  • 100
  • 144