using re to find nested results in strings

Question

hi all I have the string below:

test = '<tr> <stuff1> <tr><stuff2> </tr> </tr>'

and I would like python to return the following:

result=['<tr><stuff1><tr><stuff2></tr></tr>','<tr><stuff2></tr>']

I've tried re.finall('<tr>.+</tr>',test) but that just returns the entire string ...

Thanks

Don't parse [html/xml](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) with regexes... Regexes aren't the right tool for it. — Willem Van Onsem, Feb 22 '15 at 20:27
@user5061: now imagine you are parsing a table with ``. The rules of html/xhtml are extremely complicated... — Willem Van Onsem, Feb 22 '15 at 20:43
Using `.+?` instead of `.+` would match "as few as possible", probably solving your problem. _Note: Regular expressions **should not** be used for your problem._ — user, Feb 22 '15 at 20:46

score 2 · Accepted Answer · answered Feb 22 '15 at 20:26

2

You should use a html parser to parse html:

from bs4 import BeautifulSoup

html = """<tr> <stuff1> <tr><stuff2> </tr> </tr>"""
soup =BeautifulSoup(html)

print(soup.find_all("tr"))

answered Feb 22 '15 at 20:26

1 Answers1