0

hi all I have the string below:

test = '<tr> <stuff1> <tr><stuff2> </tr> </tr>'

and I would like python to return the following:

result=['<tr><stuff1><tr><stuff2></tr></tr>','<tr><stuff2></tr>']

I've tried re.finall('<tr>.+</tr>',test) but that just returns the entire string ...

Thanks

user1745713
  • 781
  • 4
  • 10
  • 16
  • 2
    Don't parse [html/xml](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) with regexes... Regexes aren't the right tool for it. – Willem Van Onsem Feb 22 '15 at 20:27
  • @user5061: now imagine you are parsing a table with ``. The rules of html/xhtml are extremely complicated... – Willem Van Onsem Feb 22 '15 at 20:43
  • Using `.+?` instead of `.+` would match "as few as possible", probably solving your problem. _Note: Regular expressions **should not** be used for your problem._ – user Feb 22 '15 at 20:46

1 Answers1

2

You should use a html parser to parse html:

from bs4 import BeautifulSoup

html = """<tr> <stuff1> <tr><stuff2> </tr> </tr>"""
soup =BeautifulSoup(html)

print(soup.find_all("tr"))
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321