Firstly, you need to make your pattern match non-greedy (switch .*
to .*?
). You can read more about that in the examples given in the Python docs (they even use HTML tags as an example!).
Secondly, the /?
part should be at the start, rather than after the tag name \w+
.
Also, the second \s*
is redundant, since .*
will capture whitespaces as well.
import re
s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
pat = r'</?\s*(\w+).*?>'
tags = re.findall(pat, s)
print(tags)
Output:
['p', 'a', 'a', 'p']
For a much more general solution, consider using BeautifulSoup
or HTMLParser
instead:
from html.parser import HTMLParser
class HTMLTagParser(HTMLParser):
def handle_starttag(self, tag, attrs):
tags.append(tag)
def handle_endtag(self, tag):
tags.append(tag)
s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
tags = []
parser = HTMLTagParser()
parser.feed(s)
print(tags)
Output:
['p', 'a', 'a', 'p']
The approach will work arbitrary HTML (since regex can become messy as you minimize assumptions made). Note, for start tags, the attrs
argument in handle_starttag
can also be used to retrieve the attributes of the tag, should you need them.