1

I'm somewhat stuck with this and didn't find a similar issue here.

I want to get a list of all the tag elements in the string like, e.g. <a> -> a or </b> -> b

import re

s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
pat = r'<\s*(\w+)/?\s*.*>'
tags = re.findall(pat, s)
print(tags)

Here I get ['p'] as a result. If I change the \w+ to [a-d]+ I just get ['a'] as a result.

I'd expect as result ['p', 'a', 'a', 'p'] or at least all the distinct tag values.

What did I do wrong here? Thank you!

Using Python 3.x

costaparas
  • 5,047
  • 11
  • 16
  • 26
Matt444
  • 47
  • 9

2 Answers2

3

Firstly, you need to make your pattern match non-greedy (switch .* to .*?). You can read more about that in the examples given in the Python docs (they even use HTML tags as an example!).

Secondly, the /? part should be at the start, rather than after the tag name \w+.

Also, the second \s* is redundant, since .* will capture whitespaces as well.

import re

s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
pat = r'</?\s*(\w+).*?>'
tags = re.findall(pat, s)
print(tags)

Output:

['p', 'a', 'a', 'p']

For a much more general solution, consider using BeautifulSoup or HTMLParser instead:

from html.parser import HTMLParser

class HTMLTagParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        tags.append(tag)

    def handle_endtag(self, tag):
        tags.append(tag)

s = '<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>'
tags = []
parser = HTMLTagParser()
parser.feed(s)
print(tags)

Output:

['p', 'a', 'a', 'p']

The approach will work arbitrary HTML (since regex can become messy as you minimize assumptions made). Note, for start tags, the attrs argument in handle_starttag can also be used to retrieve the attributes of the tag, should you need them.

costaparas
  • 5,047
  • 11
  • 16
  • 26
  • Thanks, that worked! The string is just an example, there can also be '/' after the tag characters if these are self-closing tags. One question: why the .*? - I thought .* matches 0 or more of any character. I don't really understand the ? here. – Matt444 Jan 09 '21 at 04:10
  • Great, I also update my answer to include an alternative using Python modules, which is a more general and reliable solution than using regex. – costaparas Jan 09 '21 at 04:14
  • 1
    @matt444 Its regex syntax for making the the `.*` non-greedy. You can read more about that [here](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy) in the Python docs -- they even use HTML tags as an example. – costaparas Jan 09 '21 at 04:15
0

use the or (|) operator and write down both the patterns separated by an operator, it should work.

refer to this, How is the AND/OR operator represented as in Regular Expressions?

Yash
  • 71
  • 4